date:20091111

Re: [PATCH] opensm/osm_state_mgr.c: force heavy sweep when fabric consists of single switch

2009-11-11 Thread Eli Dorfman (Voltaire)

Yevgeny Kliteynik wrote:
> Eli Dorfman (Voltaire) wrote:
>> Yevgeny Kliteynik wrote:
>>> Eli Dorfman (Voltaire) wrote:
 Yevgeny Kliteynik wrote:
> Eli Dorfman (Voltaire) wrote:
>> Yevgeny Kliteynik wrote:
>>> Yevgeny Kliteynik wrote:
 Line Holen wrote:
> On 11/ 4/09 04:54 PM, Yevgeny Kliteynik wrote:
>> Line Holen wrote:
>>> On 11/ 4/09 10:47 AM, Yevgeny Kliteynik wrote:
 Sasha Khapyorsky wrote:
> On 12:26 Tue 03 Nov , Yevgeny Kliteynik wrote:
>> Always do heavy sweep when there is only one node in the
>> fabric, and this node is a switch, and SM runs on top of it -
>> there may be a race when OSM starts running before the
>> external ports are ports are up, or if they went through
>> reset while SM was starting.
>> In this race switch brings up the ports and turns on the
>> PSC bit, but OSM might get PortInfo before SwitchInfo, and it
>> might see all ports as down, but PSC bit on. If that happens,
>> OSM turns off PSC bit, and it will never see external ports
>> again - it won't perform any heavy sweep, only light sweep
> Could such race happen when there are more than one node in a
> fabric?
 I think that my description of the race was misleading.
 The race can happen on *any* fabric when SM runs on switch.
 But when it does happen, SM thinks that the whole subnet
 is just one switch - that's what it managed to discover.
 I've actually seen it happening.
 So the patch fixes this particular case.

 So the next question that you would probably ask is can
 this race happen on some *other* switch and not the one
 SM is running on?

 Well, I don't know. I have a hunch that it can't, but I
 couldn't prove it to myself yet.

 The race on the managed switch is a special case because
 SM always sees port 0, and always gets responses to its
 SMP queries. On any other switch, if the ports were reset,
 SM won't get any response until the ports are up again.

 Perhaps there might be a case where SM got some port as down,
 and by the time SM got SwitchInfo with PSC bit the port
 was already up, so SM won't start discovery beyond this
 port. But this race would be fixed on the next heavy sweep,
 when SM will discover this port that it missed the previous
 time, whereas race on managed switch is fatal - SM won't
 ever do any heavy sweep.

 -- Yevgeny
>>> At least for the 3.2 branch there is a general race
>>> regardless of
>>> where the SM is running. I haven't checked the current
>>> master, but
>>> I cannot recall seeing any patches related to this so I assume
>>> the race is still there.
>>>
>>> There is a window between SM discovering a switch and
>>> clearing PSC
>>> for the same switch. The SM will not detect a state change on
>>> the
>>> switch ports during this time.
>> If the port changes state during that period, the switch issues
>> new trap 128, which (I think) should cause SM to re-discover the
>> fabric once this discovery cycle is over. Is this correct?
>>
> I think the switch shall send a trap whenever it sets the PSC bit.
> Once set I believe it will not send another trap until it is
> reset.
> Or do I misinterpret the spec ?
 I may be wrong, but I thought that this is how things work:
 - port state changes
 - switch turns on PSC bit and starts sending traps
 - SM gets the trap, sends trap repress
 - switch gets trap repress and stops sending traps
 - PSC is still on
 - port state changes again (the same or any other port)
 - switch turns on PSC bit (which doesn't matter as PSC is
   already on) and starts sending traps again
 - etc...

 Anyway, I'll double-check this issue.
>>> Yep, verified.
>>> Switch sends traps regardless the PSC bit status.
>>> Also, the spec doesn't link them together:
>>>
>>>   o14-5.1.1: If a switch supports Traps (PortInfo:
>>>   CapabilityMask.IsTrap-Supported is one), its SMA
>>>   shall send trap 128 to the SM indicated by the 
>>> PortInfo:MasterSMLID
>>> under any condition that   would cause SwitchInfo:PortStateChange to
>>> be set
>>>   to one. (See 14.2.5.4 SwitchInfo on page 827.)
>>>
>> Trap will be sent according to the SMLID. After first bring up the
>> SMLID is not set yet and trap will not be sent.
>> In th

Re: strong ordering for data registered memory

2009-11-11 Thread Dave Olson

On Wed, 11 Nov 2009, Jason Gunthorpe wrote:

| On Wed, Nov 11, 2009 at 05:44:59PM -0500, Richard Frank wrote:
| 
| > Would anyone like to through out the list of HCAs that do this... I
| > can guess at a few...  and can ask the vendors directly.. if not.. .
| > 
| > It would be much nicer to not hardcode names of adapters.. but that won't
| > stop us.. :)
| 
| Isn't it more complex than this? AFAIK the PCI-E standard does not
| specify the order which data inside a single transfer becomes visible,
| only how different transfers relate. To work on the most agressive
| PCI-E system the HCA would have to transfer the last XX bytes as a
| seperate PCI-E transaction without relaxed ordering.

I can't speak to the specifics of this on PCIe, but yes, by default
the pcie transfers within a single tag can be unordered.

| This is the sort of thing that might start to matter on QPI and HT
| memory-interleaved configurations. A multi-cache line transfer will be
| split up and completed on different chips - it may not be fully
| coherent 100% of the time.

HT is fine, by design, at least on AMD processors (probably don't care
too much about the older sibyte cpus, since they weren't fully
HT-compliant).

I don't know about QPI.

Dave Olson
dave.ol...@qlogic.com
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IB/core: export struct ib_port

2009-11-11 Thread Dave Olson

On Wed, 11 Nov 2009, Roland Dreier wrote:

| 
|  > While this is true for SLtoVL, we create other files which are
|  > device specific under the port directory too.
|  > It seems like we might need to introduce a callback into the driver to
|  > create the port specific sysfs files.
| 
| Umm, you could have said there were other things initially!

Those have been there "forever" in qib without requiring the change
in the core sysfs code.  It's only sysfs group entries that require
the patch to expose ib_port.

| Anyway, rather than a callback, I guess we could just add a place to
| attach a set of port attributes to the structure that gets passed into
| ib_register_device() maybe?

Seems like major overkill to have callbacks, when all we need is to
get the structure that "owns" (is the parent kobject of) the directory.

| And maybe we could clean up the existing code that does
| device_create_file() to use a list of device attributes also...

Seems to be a rather different issue, to me.

Dave Olson
dave.ol...@qlogic.com
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] infiniband-diags/saquery.c: Change lids and port numbers to decimal

2009-11-11 Thread Sasha Khapyorsky

On 15:32 Fri 06 Nov , Hal Rosenstock wrote:
> 
> rather than hex
> 
> Signed-off-by: Hal Rosenstock 

Applied. Thanks.

Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] opensm/osm_link_mgr.c: Fix IBA reference for PortInfo attribute

2009-11-11 Thread Sasha Khapyorsky

On 15:42 Fri 06 Nov , Hal Rosenstock wrote:
> 
> Signed-off-by: Hal Rosenstock 

Applied. Thanks.

Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] opensm - remove useless goto

2009-11-11 Thread Sasha Khapyorsky

On 15:07 Mon 09 Nov , Stan C. Smith wrote:
> 
> Signed-off-by: stan smith 

Applied. Thanks.

Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] opensm/osm_mcast_mgr.c: fix return value on alloc_mfts() failures

2009-11-11 Thread Sasha Khapyorsky

On 15:25 Wed 11 Nov , Hal Rosenstock wrote:
> 
> Yes, that fixes the return value but nothing (at least currently)
> takes advantage of that. Is that the next step ?

Yes, it should be done this way.

Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] librdmacm/mckey: add notifications on events

2009-11-11 Thread Sean Hefty

>add notifications on multicast error and address change events which
>can take place while traffic is running.

mckey is intended to be a fairly simple send/receive multicast test program.
What's the reasoning behind adding the event handling?

- Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IB/core: export struct ib_port

2009-11-11 Thread Jason Gunthorpe

On Wed, Nov 11, 2009 at 04:04:10PM -0800, Roland Dreier wrote:
> 
>  > Maybe give some thought to using a syscall interface through uverbs
>  > for some of this?
> 
> Actually I think for exposing SL-to-VL and other things like that, sysfs
> is pretty good.  Having something usable from both scripts and programs
> seems pretty useful, and having an opaque uverbs interface isn't really
> an improvement (especially when we have to design something extensible
> that device-specific stuff can be put into).

I guess it depends on the purpose, a noticable problem with sysfs is
that there is no good way to be notified when the data changes. PKey,
SL2VL, GID tables, sm_lid etc are all SM dynamic information and many
cases that are using them should probably have code to know when the
SM changes them and make appropriate adjustments.

For instance a long running SMP using program has no way to be
notified when the sm_lid changes, or the GID table changes - but it
can pick up an IB async event for the pkey table changes.. What should
new things do?

It also means we can never have something like ifrename for IB - too
racey with sysfs.

>  > IMHO, sysfs is getting out of hand for rdma:
> 
> I'm not sure how much of a problem this really is...

Neither am I.. But I've seen the various eternal lkml arguments about
sysfs, netlink, syscall, etc and it does seem like the preferred
option is a little bit of all them. It does seem worth asking from
time to time if the rdma stuff in sysfs is appropriate.

>  > $ find /sys/class/infiniband/mlx4_0 -type f | wc -l
>  > 660
> 
> and presumably 512 of those are gid and pkey table entries?

Probably. TBH, those are the ones I find most un-sysfs-like..

>  > $ strace -o /tmp/t /opt/ofa-1.5/sbin/perfquery ; grep sys/ /tmp/t | wc -l
>  > 289
> 
> That seems a little crazy, but maybe it's an app that's doing silly
> stuff?  If I do ibv_rc_pingpong, the only /sys related things I see are:

It is reading the pkey and gid tables for some reason. There is no
other way to get that data except by trundling through sysfs.. Which I
guess really is my point - it isn't so much that the stuff is in sysfs
that is strange, but that it is *only* in sysfs.

> open("/sys/class/infiniband_verbs/abi_version", O_RDONLY) = 3
> open("/sys/class/infiniband_verbs", 
> O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
> stat("/sys/class/infiniband_verbs/abi_version", {st_mode=S_IFREG|0444, 
> st_size=4096, ...}) = 0
> stat("/sys/class/infiniband_verbs/uverbs0", {st_mode=S_IFDIR|0755, 
> st_size=0, ...}) = 0
> open("/sys/class/infiniband_verbs/uverbs0/ibdev", O_RDONLY) = 4
> open("/sys/class/infiniband_verbs/uverbs0/abi_version", O_RDONLY) = 4
> open("/sys/class/infiniband_verbs/uverbs0/device/vendor", O_RDONLY) = 3
> open("/sys/class/infiniband_verbs/uverbs0/device/device", O_RDONLY) = 3
> open("/sys/class/infiniband/mlx4_0/node_type", O_RDONLY) = 3
> 
> which is reasonable I think.

Yes, I also think that is pretty much fine.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IB/core: export struct ib_port

2009-11-11 Thread Roland Dreier


 > Maybe give some thought to using a syscall interface through uverbs
 > for some of this?

Actually I think for exposing SL-to-VL and other things like that, sysfs
is pretty good.  Having something usable from both scripts and programs
seems pretty useful, and having an opaque uverbs interface isn't really
an improvement (especially when we have to design something extensible
that device-specific stuff can be put into).

 > IMHO, sysfs is getting out of hand for rdma:

I'm not sure how much of a problem this really is...

 > $ find /sys/class/infiniband/mlx4_0 -type f | wc -l
 > 660

and presumably 512 of those are gid and pkey table entries?

 > $ strace -o /tmp/t /opt/ofa-1.5/sbin/perfquery ; grep sys/ /tmp/t | wc -l
 > 289

That seems a little crazy, but maybe it's an app that's doing silly
stuff?  If I do ibv_rc_pingpong, the only /sys related things I see are:

open("/sys/class/infiniband_verbs/abi_version", O_RDONLY) = 3
open("/sys/class/infiniband_verbs", 
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
stat("/sys/class/infiniband_verbs/abi_version", {st_mode=S_IFREG|0444, 
st_size=4096, ...}) = 0
stat("/sys/class/infiniband_verbs/uverbs0", {st_mode=S_IFDIR|0755, 
st_size=0, ...}) = 0
open("/sys/class/infiniband_verbs/uverbs0/ibdev", O_RDONLY) = 4
open("/sys/class/infiniband_verbs/uverbs0/abi_version", O_RDONLY) = 4
open("/sys/class/infiniband_verbs/uverbs0/device/vendor", O_RDONLY) = 3
open("/sys/class/infiniband_verbs/uverbs0/device/device", O_RDONLY) = 3
open("/sys/class/infiniband/mlx4_0/node_type", O_RDONLY) = 3

which is reasonable I think.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] osm_mlid_box: infrastructure for mgid compression

2009-11-11 Thread Sasha Khapyorsky


Now each MLID value is represented by collection of multicast groups
- osm_mgrp_box object. All multicast routing calculation and setup
operations are performed using those MLID indexed objects. Multicast
groups are kept as linked list in osm_mgrp_box and as before are globally
indexed by its MGIDs in SM DB, so SA operations are unchanged.

This model let us to implement an actual MGID to MLID compression by
just having multiple multicast groups placed in the list under same
osm_mgrp_box object.

All current functionality is preserved.

Signed-off-by: Sasha Khapyorsky 
---
 opensm/include/opensm/osm_multicast.h  |   66 ++-
 opensm/include/opensm/osm_subnet.h |   17 ++--
 opensm/opensm/osm_mcast_mgr.c  |  210 
 opensm/opensm/osm_multicast.c  |   57 --
 opensm/opensm/osm_sa.c |   16 +--
 opensm/opensm/osm_sa_mcmember_record.c |   17 +--
 opensm/opensm/osm_subnet.c |   10 +-
 7 files changed, 215 insertions(+), 178 deletions(-)

diff --git a/opensm/include/opensm/osm_multicast.h 
b/opensm/include/opensm/osm_multicast.h
index 59e4d0d..1da575d 100644
--- a/opensm/include/opensm/osm_multicast.h
+++ b/opensm/include/opensm/osm_multicast.h
@@ -97,8 +97,8 @@ BEGIN_C_DECLS
 */
 typedef struct osm_mgrp {
cl_fmap_item_t map_item;
+   cl_list_item_t list_item;
ib_net16_t mlid;
-   osm_mtree_node_t *p_root;
cl_qmap_t mcm_port_tbl;
ib_member_rec_t mcmember_rec;
boolean_t well_known;
@@ -109,15 +109,13 @@ typedef struct osm_mgrp {
 *  map_item
 *  Map Item for fmap linkage.  Must be first element!!
 *
+*  list_item
+*  List item for linkage in osm_mgrp_box's mgrp_list qlist.
+*
 *  mlid
 *  The network ordered LID of this Multicast Group (must be
 *  >= 0xC000).
 *
-*  p_root
-*  Pointer to the root "tree node" in the single spanning tree
-*  for this multicast group.  The nodes of the tree represent
-*  switches.  Member ports are not represented in the tree.
-*
 *  mcm_port_tbl
 *  Table (sorted by port GUID) of osm_mcm_port_t objects
 *  representing the member ports of this multicast group.
@@ -133,6 +131,37 @@ typedef struct osm_mgrp {
 * SEE ALSO
 */
 
+/s* OpenSM: Multicast Group/osm_mgrp_box_t
+* NAME
+*  osm_mgrp_box_t
+*
+* DESCRIPTION
+*  Multicast structure which holds all multicast groups with same MLID.
+*
+* SYNOPSIS
+*/
+typedef struct osm_mgrp_box {
+   uint16_t mlid;
+   cl_qlist_t mgrp_list;
+   osm_mtree_node_t *root;
+} osm_mgrp_box_t;
+/*
+* FIELDS
+*  mlid
+*  The host ordered LID of this Multicast Group (must be
+*  >= 0xC000).
+*
+*  p_root
+*  Pointer to the root "tree node" in the single spanning tree
+*  for this multicast group.  The nodes of the tree represent
+*  switches.  Member ports are not represented in the tree.
+*
+*  mgrp_list
+*  List of multicast groups (mpgr object) having same MLID value.
+*
+* SEE ALSO
+*/
+
 /f* OpenSM: Multicast Group/osm_mgrp_new
 * NAME
 *  osm_mgrp_new
@@ -164,30 +193,6 @@ osm_mgrp_t *osm_mgrp_new(IN osm_subn_t * subn, IN 
ib_net16_t mlid,
 *  Multicast Group, osm_mgrp_delete
 */
 
-/f* OpenSM: Multicast Group/osm_mgrp_delete
-* NAME
-*  osm_mgrp_delete
-*
-* DESCRIPTION
-*  Destroys and deallocates a Multicast Group.
-*
-* SYNOPSIS
-*/
-void osm_mgrp_delete(IN osm_mgrp_t * p_mgrp);
-/*
-* PARAMETERS
-*  p_mgrp
-*  [in] Pointer to an osm_mgrp_t object.
-*
-* RETURN VALUES
-*  None.
-*
-* NOTES
-*
-* SEE ALSO
-*  Multicast Group, osm_mgrp_new
-*/
-
 /f* OpenSM: Multicast Group/osm_mgrp_is_guid
 * NAME
 *  osm_mgrp_is_guid
@@ -378,6 +383,7 @@ void osm_mgrp_delete_port(IN osm_subn_t * subn, IN 
osm_log_t * log,
 void osm_mgrp_remove_port(osm_subn_t * subn, osm_log_t * log, osm_mgrp_t * 
mgrp,
  osm_mcm_port_t * mcm_port, ib_member_rec_t * mcmr);
 void osm_mgrp_cleanup(osm_subn_t * subn, osm_mgrp_t * mpgr);
+void osm_mgrp_box_delete(osm_mgrp_box_t *mbox);
 
 END_C_DECLS
 #endif /* _OSM_MULTICAST_H_ */
diff --git a/opensm/include/opensm/osm_subnet.h 
b/opensm/include/opensm/osm_subnet.h
index 0302f91..fc60ced 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -518,7 +518,7 @@ typedef struct osm_subn {
boolean_t coming_out_of_standby;
unsigned need_update;
cl_fmap_t mgrp_mgid_tbl;
-   void *mgroups[IB_LID_MCAST_END_HO - IB_LID_MCAST_START_HO + 1];
+   void *mboxes[IB_LID_MCAST_END_HO - IB_LID_MCAST_START_HO + 1];
 } osm_subn_t;
 /*
 * FIELDS
@@ -643,9 +643,9 @@ typedef struct osm_subn {
 *  Container of pointers to all Multicast group objects in
 *  the subnet

[PATCH] opensm/osm_mgrp_new(): add subnet db insertion

2009-11-11 Thread Sasha Khapyorsky


Add insertion of mgrp into subnet DB in osm_mgrp_new() function code.
This consolidation makes a code cleaner and will help us to add MGID to
MLID compression model where mgrp will not be mapped directly to mlids
but using additional structure.

Signed-off-by: Sasha Khapyorsky 
---
 opensm/include/opensm/osm_multicast.h  |5 -
 opensm/opensm/osm_multicast.c  |7 ++-
 opensm/opensm/osm_sa_mcmember_record.c |7 +--
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/opensm/include/opensm/osm_multicast.h 
b/opensm/include/opensm/osm_multicast.h
index f0897f4..59e4d0d 100644
--- a/opensm/include/opensm/osm_multicast.h
+++ b/opensm/include/opensm/osm_multicast.h
@@ -142,9 +142,12 @@ typedef struct osm_mgrp {
 *
 * SYNOPSIS
 */
-osm_mgrp_t *osm_mgrp_new(IN ib_net16_t mlid, IN ib_member_rec_t * mcmr);
+osm_mgrp_t *osm_mgrp_new(IN osm_subn_t * subn, IN ib_net16_t mlid,
+IN ib_member_rec_t * mcmr);
 /*
 * PARAMETERS
+*  subn
+*  [in] Pointer to osm_subn_t object.
 *  mlid
 *  [in] Multicast LID for this multicast group.
 *
diff --git a/opensm/opensm/osm_multicast.c b/opensm/opensm/osm_multicast.c
index 8ccab8e..ff607e1 100644
--- a/opensm/opensm/osm_multicast.c
+++ b/opensm/opensm/osm_multicast.c
@@ -73,7 +73,8 @@ void osm_mgrp_delete(IN osm_mgrp_t * p_mgrp)
free(p_mgrp);
 }
 
-osm_mgrp_t *osm_mgrp_new(IN ib_net16_t mlid, IN ib_member_rec_t * mcmr)
+osm_mgrp_t *osm_mgrp_new(IN osm_subn_t * subn, IN ib_net16_t mlid,
+IN ib_member_rec_t * mcmr)
 {
osm_mgrp_t *p_mgrp;
 
@@ -86,6 +87,10 @@ osm_mgrp_t *osm_mgrp_new(IN ib_net16_t mlid, IN 
ib_member_rec_t * mcmr)
p_mgrp->mlid = mlid;
p_mgrp->mcmember_rec = *mcmr;
 
+   cl_fmap_insert(&subn->mgrp_mgid_tbl, &p_mgrp->mcmember_rec.mgid,
+  &p_mgrp->map_item);
+   subn->mgroups[cl_ntoh16(p_mgrp->mlid) - IB_LID_MCAST_START_HO] = p_mgrp;
+
return p_mgrp;
 }
 
diff --git a/opensm/opensm/osm_sa_mcmember_record.c 
b/opensm/opensm/osm_sa_mcmember_record.c
index 95c41e4..357e6ab 100644
--- a/opensm/opensm/osm_sa_mcmember_record.c
+++ b/opensm/opensm/osm_sa_mcmember_record.c
@@ -796,7 +796,7 @@ static ib_api_status_t mcmr_rcv_create_new_mgrp(IN osm_sa_t 
* sa,
 
/* create a new MC Group */
mcm_rec.mlid = mlid;
-   *pp_mgrp = osm_mgrp_new(mlid, &mcm_rec);
+   *pp_mgrp = osm_mgrp_new(sa->p_subn, mlid, &mcm_rec);
if (*pp_mgrp == NULL) {
OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B08: "
"osm_mgrp_new failed\n");
@@ -813,11 +813,6 @@ static ib_api_status_t mcmr_rcv_create_new_mgrp(IN 
osm_sa_t * sa,
(*pp_mgrp)->mcmember_rec.pkt_life &= 0x3f;
(*pp_mgrp)->mcmember_rec.pkt_life |= 2 << 6;/* exactly */
 
-   /* Insert the new group in the data base */
-   cl_fmap_insert(&sa->p_subn->mgrp_mgid_tbl,
-  &(*pp_mgrp)->mcmember_rec.mgid, &(*pp_mgrp)->map_item);
-   sa->p_subn->mgroups[cl_ntoh16(mlid) - IB_LID_MCAST_START_HO] = *pp_mgrp;
-
 Exit:
OSM_LOG_EXIT(sa->p_log);
return status;
-- 
1.6.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general] [PATCH 1/2 v4] opensm: Storage organization for multicast groups

2009-11-11 Thread Sasha Khapyorsky

Hi Slava,

This patch is outdated (it was outdated at date of posting to the list),
in particular I cleaned up already a needs to resolve multicast group by
mlid for SA PathRecord queries with multicast destination and some
others.

On 15:53 Tue 29 Sep , Slava Strebkov wrote:
> Main purpose is to prepare infrastructure for (many) mgids to one mlid
> compression.

When doing multicast cleanup, I've implemented this by myself too :).
I didn't post it then due to lack of any testing and switched to
something else. Basically it is very similar (even structure names), but
with few differences:

> Proposed the following changes:
> 1.Element in mlid array is now a multicast group box.
> 2.mgrp_box keeps a list of mgroups sharing same mlid.
> With introduction of compression, there will be many
> multicast groups per mlid. Current implementation keeps
> one mgid to one mlid ratio.
> 3.mgrp_box has a map of ports sharing same mlid. Ports sorted
> by port guid. Port map is necessary for building spanning
> tree per mgroup_box, not just for single mgroup.
> 4.Element in port map keeps a list of mgroups opened by this port.
> This allows quick deletion of mgroups when port changes
> state to DOWN.
> 5.Multicast processing functions use mgroup_box object instead
> of mgroup.

I don't have (3) and (4) - a port map per mbox is only useful when
OpenSM calculates multicast routing, so I decided instead of bothering
with updating such maps during at each MCM Record SA request to
generate local map internally when it is needed (in mcast_mgr).

I will post the patch for your review shortly.

> Signed-off-by: Slava Strebkov 

Some comments are below anyway.

> diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c
> index 9b72293..6c0a1e6 100644
> --- a/opensm/opensm/osm_qos_policy.c
> +++ b/opensm/opensm/osm_qos_policy.c
> @@ -1,5 +1,5 @@
>  /*
> - * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> + * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
>   * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
> @@ -773,6 +773,8 @@ static void __qos_policy_validate_pkey(
>   uint32_t flow;
>   uint8_t hop;
>   osm_mgrp_t * p_mgrp;
> + osm_mgrp_box_t * p_mgrp_box;
> + cl_list_item_t *p_item;
>  
>   if (!p_qos_policy || !p_qos_match_rule || !p_prtn)
>   return;
> @@ -796,28 +798,33 @@ static void __qos_policy_validate_pkey(
>   if (!p_prtn->mlid)
>   return;
>  
> - p_mgrp = osm_get_mgrp_by_mlid(p_qos_policy->p_subn, p_prtn->mlid);
> - if (!p_mgrp) {
> + p_mgrp_box = osm_get_mgrp_box_by_mlid(p_qos_policy->p_subn, 
> p_prtn->mlid);
> + if (!p_mgrp_box) {
>   OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_ERROR,
> - "ERR AC16: MCast group for partition with "
> + "ERR AC16: MCast group box for partition with "
>   "pkey 0x%04X not found\n",
>   cl_ntoh16(p_prtn->pkey));
>   return;
>   }
>  
> - CL_ASSERT((cl_ntoh16(p_mgrp->mcmember_rec.pkey) & 0x7fff) ==
> -   (cl_ntoh16(p_prtn->pkey) & 0x7fff));
> -
> - ib_member_get_sl_flow_hop(p_mgrp->mcmember_rec.sl_flow_hop,
> -   &sl, &flow, &hop);
> - if (sl != p_prtn->sl) {
> - OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
> - "Updating MCGroup (MLID 0x%04x) SL to "
> - "match partition SL (%u)\n",
> - cl_hton16(p_mgrp->mcmember_rec.mlid),
> - p_prtn->sl);
> - p_mgrp->mcmember_rec.sl_flow_hop =
> + p_item = cl_qlist_head(&p_mgrp_box->mgrp_list);
> + while (p_item != cl_qlist_end(&p_mgrp_box->mgrp_list)) {
> + p_mgrp = (osm_mgrp_t *) PARENT_STRUCT(p_item, osm_mgrp_t,
> + box_item);
> + p_item = cl_qlist_next(p_item);
> + CL_ASSERT((cl_ntoh16(p_mgrp->mcmember_rec.pkey) & 0x7fff) ==
> + (cl_ntoh16(p_prtn->pkey) & 0x7fff));
> + ib_member_get_sl_flow_hop(p_mgrp->mcmember_rec.sl_flow_hop,
> + &sl, &flow, &hop);
> + if (sl != p_prtn->sl) {
> + OSM_LOG(&p_qos_policy->p_subn->p_osm->log, 
> OSM_LOG_DEBUG,
> + "Updating MCGroup (MLID 0x%04x) SL to "
> + "match partition SL (%u)\n",
> + cl_hton16(p_mgrp->mcmember_rec.mlid),
> + p_prtn->sl);
> + p_mgrp->mcmember_rec.sl_flow_hop =
>   ib_member_set_sl_flow_hop(p_prtn->sl, flow, hop);
> + }

Seems that when QoS requests using certain SL value on some partition
you instead of altering SL value for only mu

Re: [PATCH] IB/core: export struct ib_port

2009-11-11 Thread Jason Gunthorpe

On Wed, Nov 11, 2009 at 03:22:50PM -0800, Ralph Campbell wrote:

> While this is true for SLtoVL, we create other files which are
> device specific under the port directory too.
> It seems like we might need to introduce a callback into the driver to
> create the port specific sysfs files.

Maybe give some thought to using a syscall interface through uverbs
for some of this?

IMHO, sysfs is getting out of hand for rdma:

$ find /sys/class/infiniband/mlx4_0 -type f | wc -l
660
$ strace -o /tmp/t /opt/ofa-1.5/sbin/perfquery ; grep sys/ /tmp/t | wc -l
289

That is alot of syscalls just to send two SMPs.

It just seems to me there are not that many examples of APIs that
require so much trundling through sysfs to do common every day
application tasks.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IB/core: export struct ib_port

2009-11-11 Thread Roland Dreier


 > While this is true for SLtoVL, we create other files which are
 > device specific under the port directory too.
 > It seems like we might need to introduce a callback into the driver to
 > create the port specific sysfs files.

Umm, you could have said there were other things initially!

Anyway, rather than a callback, I guess we could just add a place to
attach a set of port attributes to the structure that gets passed into
ib_register_device() maybe?

And maybe we could clean up the existing code that does
device_create_file() to use a list of device attributes also...

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IB/core: export struct ib_port

2009-11-11 Thread Ralph Campbell

On Wed, 2009-11-11 at 15:02 -0800, Roland Dreier wrote:
> > | Hmm, maybe we should just add a vls directory with sl0 ... sl15 or
>  > | something like that in generic code?  I don't see why this needs to be
>  > | driver-specific code.
>  > 
>  > No particular reason, it just didn't seem likely to be useful on other
>  > HCA drivers.   I can redo the patches that way, if people think it's
>  > the right thing to do.
> 
> To me it does seem like something generic.  SLtoVL table is required of
> all CAs, so we might as well create it for all IB devices... as I see it
> the advantages of having it core code are:
> 
>  - no need to expose internals of sysfs code port structure to low level
>drivers (we could also avoid this layering violation by giving a
>generic way for low-level drivers to add port attributes)
>  - IB-specified info is available for all IB devices with the same
>format etc.  It may not be important for non-qlogic devices but there
>is some utility in SL mapping for debugging etc.
> 
> the only disadvantage I see is that it adds the overhead of having those
> sysfs attributes for all systems with an RDMA devices, even if the
> qlogic driver is never loaded.  But that overhead is pretty much just a
> small amount of extra code that will never be run and a few sysfs
> structures that will never be touched, so it just takes up a little bit
> of memory.  For RDMA-using systems, I can't imagine it matters.
> 
>  - R.

While this is true for SLtoVL, we create other files which are
device specific under the port directory too.
It seems like we might need to introduce a callback into the driver to
create the port specific sysfs files.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: strong ordering for data registered memory

2009-11-11 Thread Jason Gunthorpe

On Wed, Nov 11, 2009 at 05:44:59PM -0500, Richard Frank wrote:

> Would anyone like to through out the list of HCAs that do this... I
> can guess at a few...  and can ask the vendors directly.. if not.. .
> 
> It would be much nicer to not hardcode names of adapters.. but that won't
> stop us.. :)

Isn't it more complex than this? AFAIK the PCI-E standard does not
specify the order which data inside a single transfer becomes visible,
only how different transfers relate. To work on the most agressive
PCI-E system the HCA would have to transfer the last XX bytes as a
seperate PCI-E transaction without relaxed ordering.

This is the sort of thing that might start to matter on QPI and HT
memory-interleaved configurations. A multi-cache line transfer will be
split up and completed on different chips - it may not be fully
coherent 100% of the time.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: strong ordering for data registered memory

2009-11-11 Thread Roland Dreier


 > I decided to minimize the impact of an API change on the class of
 > applications that use the current verbs interface because those
 > applications can safely run on platforms that deliver optimal
 > performance using weak ordering for data buffers.  New binaries aren't
 > required for this class of application.
 > 
 > I thought it would be more appropriate to put the burden of added
 > complexity on the class of applications that bypass the verbs to
 > access special features in the hardware.  In fact, those applications
 > are selective about memory regions that need this special handling and
 > would register lots of memory without the "strong ordering' bit.  How
 > applications determine that the platform is capable of performing the
 > request would be beyond the scope of the verbs, however, I suppose
 > that the verbs framework could check and return an error.
 > 
 > If there are applications that expect the hardware to support "strong
 > ordering" and don't check the hardware, then these might be a problem.
 > Do any of these exists?
 > 
 > By the way, if I had proposed this bit several years ago, then I would
 > have chosen a "weak ordering" flag.  Instead, I decided to try
 > protecting the existing base of verbs-based software.

I can't really follow this.  Right now Open MPI et al assume that if
they see a Mellanox adapter, they get the "last byte of RDMA becomes
visible last" behavior.  And there is not a way that I know of to turn
this off at all, let alone get any performance difference.  The
exception being the Cell processor system that started the previous
discussion, where weak ordering at the platform level helped things.

But given that current software does seem to rely on ordering, it seems
that opting into weak ordering would break fewer applications.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IB/core: export struct ib_port

2009-11-11 Thread Roland Dreier


 > | Hmm, maybe we should just add a vls directory with sl0 ... sl15 or
 > | something like that in generic code?  I don't see why this needs to be
 > | driver-specific code.
 > 
 > No particular reason, it just didn't seem likely to be useful on other
 > HCA drivers.   I can redo the patches that way, if people think it's
 > the right thing to do.

To me it does seem like something generic.  SLtoVL table is required of
all CAs, so we might as well create it for all IB devices... as I see it
the advantages of having it core code are:

 - no need to expose internals of sysfs code port structure to low level
   drivers (we could also avoid this layering violation by giving a
   generic way for low-level drivers to add port attributes)
 - IB-specified info is available for all IB devices with the same
   format etc.  It may not be important for non-qlogic devices but there
   is some utility in SL mapping for debugging etc.

the only disadvantage I see is that it adds the overhead of having those
sysfs attributes for all systems with an RDMA devices, even if the
qlogic driver is never loaded.  But that overhead is pretty much just a
small amount of extra code that will never be run and a few sysfs
structures that will never be touched, so it just takes up a little bit
of memory.  For RDMA-using systems, I can't imagine it matters.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: strong ordering for data registered memory

2009-11-11 Thread Richard Frank

Would anyone like to through out the list of HCAs that do this... I can 
guess at a few...

and can ask the vendors directly.. if not.. .

It would be much nicer to not hardcode names of adapters.. but that won't
stop us.. :)

David Brean wrote:
Yes, there are HCAs that provide strong ordering.  And an application 
such as OpenMPI checks the HCA model and if appropriate enables a 
mechanism called "eager RDMA" that depends on it.

-David

Richard Frank wrote:
Today apps are forced to assume that all transports can not provide 
strong ordering..

and hence must implement solutions to work around this.

There are specific optimizations an app might make if it knows the 
underpinning

transport can make these guarantees..

It would be useful if "strong ordering" were exposed as attribute 
from a transport..
and as well - have the ability to provide a hint to enable "strong 
ordering" on either

registration or per operation or at the qp level .

Are there any HCAs that provide "strong ordering" today ?

Roland Dreier wrote:

 > Some time ago there was an email sent to this group with the subject
 > "weak ordering for data registered memory".  I don't recall any 
action

 > resulting from this thread.  So, I have a question.  If a bit were
 > defined to specify "strong ordering", perhaps as a "access" flag 
(see
 > ibv_access_flags) and used with ibv_reg_mr(), would that be 
sufficient
 > for (1) client applications that need a HW "guarantee" of writing 
the

 > last byte of an RDMA last and (2) platform implementations that need
 > to deliver that feature?

What would happen if an application asked for strong ordering and the
adapter and/or platform is not capable of that?

Weak ordering is a bit easier to handle -- the app is saying "if you 
can

make things go faster, don't worry about ordering here" and a platform
where it doesn't matter can just ignore it.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe 
linux-rdma" in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: strong ordering for data registered memory

2009-11-11 Thread David Brean

Yes, there are HCAs that provide strong ordering.  And an application 
such as OpenMPI checks the HCA model and if appropriate enables a 
mechanism called "eager RDMA" that depends on it.

-David

Richard Frank wrote:
Today apps are forced to assume that all transports can not provide 
strong ordering..

and hence must implement solutions to work around this.

There are specific optimizations an app might make if it knows the 
underpinning

transport can make these guarantees..

It would be useful if "strong ordering" were exposed as attribute from 
a transport..
and as well - have the ability to provide a hint to enable "strong 
ordering" on either

registration or per operation or at the qp level .

Are there any HCAs that provide "strong ordering" today ?

Roland Dreier wrote:

 > Some time ago there was an email sent to this group with the subject
 > "weak ordering for data registered memory".  I don't recall any 
action

 > resulting from this thread.  So, I have a question.  If a bit were
 > defined to specify "strong ordering", perhaps as a "access" flag (see
 > ibv_access_flags) and used with ibv_reg_mr(), would that be 
sufficient

 > for (1) client applications that need a HW "guarantee" of writing the
 > last byte of an RDMA last and (2) platform implementations that need
 > to deliver that feature?

What would happen if an application asked for strong ordering and the
adapter and/or platform is not capable of that?

Weak ordering is a bit easier to handle -- the app is saying "if you can
make things go faster, don't worry about ordering here" and a platform
where it doesn't matter can just ignore it.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: strong ordering for data registered memory

2009-11-11 Thread David Brean

I decided to minimize the impact of an API change on the class of 
applications that use the current verbs interface because those 
applications can safely run on platforms that deliver optimal 
performance using weak ordering for data buffers.  New binaries aren't 
required for this class of application.

I thought it would be more appropriate to put the burden of added 
complexity on the class of applications that bypass the verbs to access 
special features in the hardware.  In fact, those applications are 
selective about memory regions that need this special handling and would 
register lots of memory without the "strong ordering' bit.  How 
applications determine that the platform is capable of performing the 
request would be beyond the scope of the verbs, however, I suppose that 
the verbs framework could check and return an error.

If there are applications that expect the hardware to support "strong 
ordering" and don't check the hardware, then these might be a problem.  
Do any of these exists?

By the way, if I had proposed this bit several years ago, then I would 
have chosen a "weak ordering" flag.  Instead, I decided to try 
protecting the existing base of verbs-based software.

-David

Roland Dreier wrote:

 > Some time ago there was an email sent to this group with the subject
 > "weak ordering for data registered memory".  I don't recall any action
 > resulting from this thread.  So, I have a question.  If a bit were
 > defined to specify "strong ordering", perhaps as a "access" flag (see
 > ibv_access_flags) and used with ibv_reg_mr(), would that be sufficient
 > for (1) client applications that need a HW "guarantee" of writing the
 > last byte of an RDMA last and (2) platform implementations that need
 > to deliver that feature?

What would happen if an application asked for strong ordering and the
adapter and/or platform is not capable of that?

Weak ordering is a bit easier to handle -- the app is saying "if you can
make things go faster, don't worry about ordering here" and a platform
where it doesn't matter can just ignore it.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ipath now and then (was [PATCH] IB/core: export struct ib_port)

2009-11-11 Thread Ralph Campbell

On Wed, 2009-11-11 at 13:18 -0800, Or Gerlitz wrote:
> On Wed, Nov 11, 2009 at 11:06 PM, Dave Olson  wrote:
> > And yes, the ib_ipath is being fully deprecated.  The "full set" of
> > patches that adds ib_qib upstream will include a subset that drops
> > ib_ipath.   All the bug fixes and feature work have been done for ib_qib
> 
> It was brought up in few occasions that the ipath driver can be
> changed such that it becomes a software IBoE driver (e.g use packet
> socket with the IBoE ether type for the IB L2 emulation).
> If it doesn't have to serve for the qlogic HCA anymore, this
> transformation might be even eaiser.
> I wonder if its better to remove it now and maybe return it later with
> the new facelift or leave it till the change is done.
> 
> Or.

I don't understand what you are suggesting.
The kernel module name ib_ipath and/or directory name
drivers/infiniband/hw/ipath could be reused for some
other purpose certainly.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ipath now and then (was [PATCH] IB/core: export struct ib_port)

2009-11-11 Thread Or Gerlitz

On Wed, Nov 11, 2009 at 11:06 PM, Dave Olson  wrote:
> And yes, the ib_ipath is being fully deprecated.  The "full set" of
> patches that adds ib_qib upstream will include a subset that drops
> ib_ipath.   All the bug fixes and feature work have been done for ib_qib

It was brought up in few occasions that the ipath driver can be
changed such that it becomes a software IBoE driver (e.g use packet
socket with the IBoE ether type for the IB L2 emulation).
If it doesn't have to serve for the qlogic HCA anymore, this
transformation might be even eaiser.
I wonder if its better to remove it now and maybe return it later with
the new facelift or leave it till the change is done.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IB/core: export struct ib_port

2009-11-11 Thread Dave Olson

On Wed, 11 Nov 2009, Roland Dreier wrote:
|  > It is used by the new ib_qib driver to expose the SL to VL table
|  > since the user level MPI library (libpsm) constructs packets including
|  > the IB header. After the driver calls ib_register_device(),
|  > it calls device_create_file() to create the files in
|  > /sys/class/infiniband/qib0/. Then it uses struct ib_device->port_list
|  > to get the pointer to ib_port to add a directory similar to "gids"
|  > and "pkeys" for each SL.
| 
| Hmm, maybe we should just add a vls directory with sl0 ... sl15 or
| something like that in generic code?  I don't see why this needs to be
| driver-specific code.

No particular reason, it just didn't seem likely to be useful on other
HCA drivers.   I can redo the patches that way, if people think it's
the right thing to do.

|  > Yes, this is what I'm working on.
|  > The patch is the only change outside of the hw/qib/ directory.
|  > Do you want to see a preview of the sysfs code?
| 
| I think the earlier you can post the whole driver, the sooner you'll get
| it upstream.

Yeah, we know...   Schedule and people conflicts have slowed this down.
We want to do it, and soon.

And yes, the ib_ipath is being fully deprecated.  The "full set" of
patches that adds ib_qib upstream will include a subset that drops
ib_ipath.   All the bug fixes and feature work have been done for ib_qib

Dave Olson
dave.ol...@qlogic.com
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] RDMA/addr: Use appropriate locking with for_each_netdev()

2009-11-11 Thread Sean Hefty

>Would it be possible for you to take Eric's patch as the first in your
>set (keeping his From: of course) and base your fixes on top of that?

Will do.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IB/core: export struct ib_port

2009-11-11 Thread Roland Dreier


 > It is used by the new ib_qib driver to expose the SL to VL table
 > since the user level MPI library (libpsm) constructs packets including
 > the IB header. After the driver calls ib_register_device(),
 > it calls device_create_file() to create the files in
 > /sys/class/infiniband/qib0/. Then it uses struct ib_device->port_list
 > to get the pointer to ib_port to add a directory similar to "gids"
 > and "pkeys" for each SL.

Hmm, maybe we should just add a vls directory with sl0 ... sl15 or
something like that in generic code?  I don't see why this needs to be
driver-specific code.

 > Yes, this is what I'm working on.
 > The patch is the only change outside of the hw/qib/ directory.
 > Do you want to see a preview of the sysfs code?

I think the earlier you can post the whole driver, the sooner you'll get
it upstream.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] opensm/osm_mcast_mgr.c: fix return value on alloc_mfts() failures

2009-11-11 Thread Hal Rosenstock

On Wed, Nov 11, 2009 at 1:33 PM, Sasha Khapyorsky  wrote:
>
> When alloc_mfts() fails and multicast routing calculation interrupted
> return -1 value to a caller.

Yes, that fixes the return value but nothing (at least currently)
takes advantage of that. Is that the next step ?

-- Hal

>
> Signed-off-by: Sasha Khapyorsky 
> ---
>  opensm/opensm/osm_mcast_mgr.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c
> index 105e905..7bd7add 100644
> --- a/opensm/opensm/osm_mcast_mgr.c
> +++ b/opensm/opensm/osm_mcast_mgr.c
> @@ -1066,6 +1066,7 @@ int osm_mcast_mgr_process(osm_sm_t * sm)
>        if (alloc_mfts(sm)) {
>                OSM_LOG(sm->p_log, OSM_LOG_ERROR,
>                        "ERR 0A07: alloc_mfts failed\n");
> +               ret = -1;
>                goto exit;
>        }
>
> @@ -1110,6 +,7 @@ int osm_mcast_mgr_process_mgroups(osm_sm_t * sm)
>        if (alloc_mfts(sm)) {
>                OSM_LOG(sm->p_log, OSM_LOG_ERROR,
>                        "ERR 0A09: alloc_mfts failed\n");
> +               ret = -1;
>                goto exit;
>        }
>
> --
> 1.6.5.2
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IB/core: export struct ib_port

2009-11-11 Thread Ralph Campbell

On Wed, 2009-11-11 at 11:19 -0800, Roland Dreier wrote:
> > This patch moves the definition of struct ib_port from
>  > sysfs.c to ib_verbs.h so that HCAs can create files in
>  > /sys/class/infiniband//ports//
> 
> um, maybe, but we need to see how it gets used first.  How do you get
> the to struct ib_port in driver code?  Maybe it would make more sense to
> add a way for low-level drivers to pass in port attributes that get
> added when the port structure gets created?

It is used by the new ib_qib driver to expose the SL to VL table
since the user level MPI library (libpsm) constructs packets including
the IB header. After the driver calls ib_register_device(),
it calls device_create_file() to create the files in
/sys/class/infiniband/qib0/. Then it uses struct ib_device->port_list
to get the pointer to ib_port to add a directory similar to "gids"
and "pkeys" for each SL.

> By the way, any plans to resume working on the upstream driver for
> qlogic HCAs?  Do you still plan to deprecate ib_ipath and add a new
> driver for new devices?

Yes, this is what I'm working on.
The patch is the only change outside of the hw/qib/ directory.
Do you want to see a preview of the sysfs code?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IB/core: export struct ib_port

2009-11-11 Thread Roland Dreier


 > This patch moves the definition of struct ib_port from
 > sysfs.c to ib_verbs.h so that HCAs can create files in
 > /sys/class/infiniband//ports//

um, maybe, but we need to see how it gets used first.  How do you get
the to struct ib_port in driver code?  Maybe it would make more sense to
add a way for low-level drivers to pass in port attributes that get
added when the port structure gets created?

By the way, any plans to resume working on the upstream driver for
qlogic HCAs?  Do you still plan to deprecate ib_ipath and add a new
driver for new devices?

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] IB/core: export struct ib_port

2009-11-11 Thread Ralph Campbell

This patch moves the definition of struct ib_port from
sysfs.c to ib_verbs.h so that HCAs can create files in
/sys/class/infiniband//ports//

Signed-off-by: Ralph Campbell 
---

diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index 158a214..e01f3e7 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -39,14 +39,6 @@
 
 #include 
 
-struct ib_port {
-   struct kobject kobj;
-   struct ib_device  *ibdev;
-   struct attribute_group gid_group;
-   struct attribute_group pkey_group;
-   u8 port_num;
-};
-
 struct port_attribute {
struct attribute attr;
ssize_t (*show)(struct ib_port *, struct port_attribute *, char *buf);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index c179318..5d23957 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1154,6 +1154,14 @@ struct ib_device {
u8   phys_port_cnt;
 };
 
+struct ib_port {
+   struct kobject kobj;
+   struct ib_device  *ibdev;
+   struct attribute_group gid_group;
+   struct attribute_group pkey_group;
+   u8 port_num;
+};
+
 struct ib_client {
char  *name;
void (*add)   (struct ib_device *);


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] opensm/osm_mcast_mgr.c: fix return value on alloc_mfts() failures

2009-11-11 Thread Sasha Khapyorsky


When alloc_mfts() fails and multicast routing calculation interrupted
return -1 value to a caller.

Signed-off-by: Sasha Khapyorsky 
---
 opensm/opensm/osm_mcast_mgr.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c
index 105e905..7bd7add 100644
--- a/opensm/opensm/osm_mcast_mgr.c
+++ b/opensm/opensm/osm_mcast_mgr.c
@@ -1066,6 +1066,7 @@ int osm_mcast_mgr_process(osm_sm_t * sm)
if (alloc_mfts(sm)) {
OSM_LOG(sm->p_log, OSM_LOG_ERROR,
"ERR 0A07: alloc_mfts failed\n");
+   ret = -1;
goto exit;
}
 
@@ -1110,6 +,7 @@ int osm_mcast_mgr_process_mgroups(osm_sm_t * sm)
if (alloc_mfts(sm)) {
OSM_LOG(sm->p_log, OSM_LOG_ERROR,
"ERR 0A09: alloc_mfts failed\n");
+   ret = -1;
goto exit;
}
 
-- 
1.6.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] RDMA/addr: Use appropriate locking with for_each_netdev()

2009-11-11 Thread Roland Dreier


 > >for_each_netdev() should be used with RTNL or dev_base_lock held,
 > >or risk a crash.
 > >
 > >Signed-off-by: Eric Dumazet 
 > 
 > Thanks - I'm working on a patch set in this area.  Roland, I can merge Eric's
 > changes into that patch set if it makes things easier.

Would it be possible for you to take Eric's patch as the first in your
set (keeping his From: of course) and base your fixes on top of that?
That seems the cleanest thing to me.

Thanks,
  Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] RDMA/addr: Use appropriate locking with for_each_netdev()

2009-11-11 Thread Sean Hefty

>for_each_netdev() should be used with RTNL or dev_base_lock held,
>or risk a crash.
>
>Signed-off-by: Eric Dumazet 

Thanks - I'm working on a patch set in this area.  Roland, I can merge Eric's
changes into that patch set if it makes things easier.

> drivers/infiniband/core/addr.c |9 +++--
> 1 files changed, 7 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
>index bd07803..5ca0b2c 100644
>--- a/drivers/infiniband/core/addr.c
>+++ b/drivers/infiniband/core/addr.c
>@@ -131,6 +131,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct

The changes to this function are still valid.

>@@ -391,15 +393,17 @@ static int addr_resolve_local(struct sockaddr *src_in,

This function is going away.

- Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: strong ordering for data registered memory

2009-11-11 Thread Richard Frank

Today apps are forced to assume that all transports can not provide 
strong ordering..

and hence must implement solutions to work around this.

There are specific optimizations an app might make if it knows the 
underpinning

transport can make these guarantees..

It would be useful if "strong ordering" were exposed as attribute from a 
transport..
and as well - have the ability to provide a hint to enable "strong 
ordering" on either

registration or per operation or at the qp level .

Are there any HCAs that provide "strong ordering" today ?

Roland Dreier wrote:

 > Some time ago there was an email sent to this group with the subject
 > "weak ordering for data registered memory".  I don't recall any action
 > resulting from this thread.  So, I have a question.  If a bit were
 > defined to specify "strong ordering", perhaps as a "access" flag (see
 > ibv_access_flags) and used with ibv_reg_mr(), would that be sufficient
 > for (1) client applications that need a HW "guarantee" of writing the
 > last byte of an RDMA last and (2) platform implementations that need
 > to deliver that feature?

What would happen if an application asked for strong ordering and the
adapter and/or platform is not capable of that?

Weak ordering is a bit easier to handle -- the app is saying "if you can
make things go faster, don't worry about ordering here" and a platform
where it doesn't matter can just ignore it.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: back to back RDMA read fail?

2009-11-11 Thread Dotan Barak


I have 2 questions:
1) if you change the opcode to RDMA Write, do you still experience this 
problem?

   (assuming that the permissions allow RDMA Write; if not, fix this issue)

2) what are the values of the the outstanding RDMA Read/Atomic in both 
QPs (as initiator and as target)?


Dotan

neutron wrote:

On Wed, Nov 11, 2009 at 4:52 AM, Dotan Barak  wrote:
  

Hi.

how do you connect the QPs?
via CM/CMA or by sockets (and you actually call the ibv_modify_qp)?




I exchange the initial QP infortion (lid, qpn, psn) via sockets.  No
CM is used. I manually take are of everything.

Thanks!

  

Dotan

neutron wrote:


Hi Paul, thanks a lot for your quick reply!

In my test,  client informs the server of its local memory (rkey,
addr, size) by sending 4 back to back messages,  each message elicits
a RDMA read request (RR) from the server.

In other words, client exposes its memory to the server, and server
RDMA reads it.

As far as RDMA read is concerned, server is a requester, and client is
a responder, right?

The error I encountered happens at the initial phase, when client
sends 4 back to back messages to server(using ibv_post_send ),
containing (rkey, addr, size) client's local memory.

In these 4 ibv_post_send(), client will see one failure.   At server
side, server has already posted enough WQs in the RQ.  The failures
are included in my first email.

Looking at the program output, it appears that, server gets messages
1, issues RR 1, gets message 2, issues RR 2.But somehow client
reports that "send message 2" fails.

On the contrary, server reports "receive message 3" fails.

As a result, server gets message 1,2,4, and succeeds with RR 1,2,4.
But clients sees that message 2 fails, and succeed with message 1,3,4.
 This inconsistency is the problem that puzzled me.


By the way, how to interpret the parameters for RDMA, and what are
parameters that control RDMA behavior?  Below are something I can
find, there must be more

  max_qp_rd_atom: 4
  max_res_rd_atom:258048
  max_qp_init_rd_atom:128

  qp_attr.max_dest_rd_atomic
  qp_attr.max_rd_atomic



-neutron



On Tue, Nov 10, 2009 at 2:04 AM, Paul Grun 
wrote:

  

Is it possible that you exceeded the number of available RDMA Read
Resources
available on the server?  There is an expectation that the client knows
how
many outstanding RDMA Read Requests the responder (server) is capable of
handling; if the requester (client) exceeds that number, the responder
will
indeed return a NAK-Invalid Request.  Sounds like your server is
configured
to accept three outstanding RDMA Read Requests.
This also explains why it works when you pause the program
periodically...it
gives the responder time to generate the RDMA Read Responses and
therefore
free up some resources to be used in receiving the next incoming RDMA
Read
Request.

-Paul

-Original Message-
From: linux-rdma-ow...@vger.kernel.org
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of neutron
Sent: Monday, November 09, 2009 9:04 PM
To: linux-rdma@vger.kernel.org
Subject: back to back RDMA read fail?

Hi all,

I have a simple program that test back to back RDMA read performance.
However I encountered errors for unknown reasons.

The basic flow of my program is:

client:
ibv_post_send() to send 4 back to back messages to server (no delay
inbetween). Each message contains the (rkey, addr, size) of a local
buffer. The buffer is registered with remote-read/write/ permissions.
After that, ibv_poll_cq() is called to wait for completion.

server:
First, enough receive WRs are posted to the RQ.  Upon receipt of each
message, immediately post a RDMA read request, using the (rkey, addr,
size) information contained in the originating message.

--
Both client and server use RC QP.  Some errors are observed.

On client side,  ibv_poll_cq() gets 4 CQE, one out of the 4 CQE is an
error:
CQ::  wr_id=0x0, wc_opcode=IBV_WC_SEND, wc_status=remote invalid RD
request, wc_flag=0x3b
byte_len=11338758, immdata=1110104528, qp_num=0x0, src_qp=2290530758

The other 3 CQE are success.

On server side,
3 of the 4 messages are successfully received. One message produces an
error CQE:
CQ::  wr_id=0x80, wc_opcode=Unknow-wc-opcode,
wc_status=unknown, wc_flag=0x0
byte_len=9569287, immdata=0, qp_num=0x0, src_qp=265551872

The 3 RDMA read corresponding to the successful receive all succeed.

But, if I pause the client program for a short while( usleep(100) for
example ) after calling ibv_post_send(), then no error occurs.
Anyone can point out the pitfall here? Thanks!


---
On both client and server, I'm using  'mthca0' type MT25208.  The QPs
are initialized with "qp_attr.max_dest_rd_atomic=4,
qp_attr.max_rd_atomic = 4".  The QP's "devinfo -v" gives the
information:

hca_id: mthca0
  fw_ver: 5.1.400
  node_guid:  0002:c902:0023:c04c
  sys_image_guid:

Re: strong ordering for data registered memory

2009-11-11 Thread Roland Dreier


 > Some time ago there was an email sent to this group with the subject
 > "weak ordering for data registered memory".  I don't recall any action
 > resulting from this thread.  So, I have a question.  If a bit were
 > defined to specify "strong ordering", perhaps as a "access" flag (see
 > ibv_access_flags) and used with ibv_reg_mr(), would that be sufficient
 > for (1) client applications that need a HW "guarantee" of writing the
 > last byte of an RDMA last and (2) platform implementations that need
 > to deliver that feature?

What would happen if an application asked for strong ordering and the
adapter and/or platform is not capable of that?

Weak ordering is a bit easier to handle -- the app is saying "if you can
make things go faster, don't worry about ordering here" and a platform
where it doesn't matter can just ignore it.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Patch] Init ipoib_neigh.dgid

2009-11-11 Thread David J. Wilder

Ipoib can miss a change in dgid under some conditions.  The problem is
caused when ipoib_neigh->dgid contains a stale address.  The fix is to
set ipoib_neigh->dgid to zero in ipoib_neigh_alloc().

Detail description: A systems using bonding on its ipoib interface has
switched it active slave interface from interface A to B and back to A
setting up the situation for this bug.  The system that fails will not
correctly processes the 2nd address change.

When an address has changed neighbor->ha is updated with the new address.
Each neighbor has an associated ipoib_neigh.  ipoib_neigh->dgid also
holds a copy of the remote node's hardware address.  When an address
changes neighbor->ha is updated by the network layer (arp code) with the
new address.  Ipoib detects this change in ipoib_start_xmit() by comparing
neighbor->ha with ipoib_neigh->dgid.  The bug is that ipoib_neigh->dgid
already contains the new address(A) thus the change from B to A is missed
by ipoib.  Here is the sequence of events:

ipoib_neigh->dgid = A neighbor->ha=A

The address is switched to B (the first switch)

neighbor->ha=B

The change is seen in ipoib_start_xmit(). neighbor->ha !=
ipoib_neigh->dgid

The ipoib_neigh is released, and a new one is allocated.

The memory allocation system returned the same chunk of memory that was
just released, therefore ipoib_neigh->dgid still contains A at this point.

ipoib_neigh->dgid should be updated in neigh_add_path(), but if the
following conditions are true dgid is not updated.

1) __path_find() returns a path

2) path->ah is NULL

The remote system now switches from address B to A, neighbor->ha is
updated to A.

Now we have: ipoib_neigh->dgid = A neighbor->ha=A

Since the address are the same ipoib won't process the change in address.

Signed-off-by: David Wilder 

--
 drivers/infiniband/ulp/ipoib/ipoib_main.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 2bf5116..25ef50b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -884,6 +884,7 @@ struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour 
*neighbour,
 
neigh->neighbour = neighbour;
neigh->dev = dev;
+   memset(&neigh->dgid.raw, 0, sizeof(union ib_gid));
*to_ipoib_neigh(neighbour) = neigh;
skb_queue_head_init(&neigh->queue);
ipoib_cm_set(neigh, NULL);


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH RDMA] Fixup IPv6 support and IPv4 routing corner cases for RDMA CM

2009-11-11 Thread Sean Hefty

>Are Jason's patches a superset of David's patches? or they need to be
>applied and only then David's work can be re-reviewed/merged, etc?

I believe that Jason's patches are a superset of David's.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: back to back RDMA read fail?

2009-11-11 Thread neutron

On Wed, Nov 11, 2009 at 4:52 AM, Dotan Barak  wrote:
> Hi.
>
> how do you connect the QPs?
> via CM/CMA or by sockets (and you actually call the ibv_modify_qp)?
>

I exchange the initial QP infortion (lid, qpn, psn) via sockets.  No
CM is used. I manually take are of everything.

Thanks!

> Dotan
>
> neutron wrote:
>>
>> Hi Paul, thanks a lot for your quick reply!
>>
>> In my test,  client informs the server of its local memory (rkey,
>> addr, size) by sending 4 back to back messages,  each message elicits
>> a RDMA read request (RR) from the server.
>>
>> In other words, client exposes its memory to the server, and server
>> RDMA reads it.
>>
>> As far as RDMA read is concerned, server is a requester, and client is
>> a responder, right?
>>
>> The error I encountered happens at the initial phase, when client
>> sends 4 back to back messages to server(using ibv_post_send ),
>> containing (rkey, addr, size) client's local memory.
>>
>> In these 4 ibv_post_send(), client will see one failure.   At server
>> side, server has already posted enough WQs in the RQ.  The failures
>> are included in my first email.
>>
>> Looking at the program output, it appears that, server gets messages
>> 1, issues RR 1, gets message 2, issues RR 2.    But somehow client
>> reports that "send message 2" fails.
>>
>> On the contrary, server reports "receive message 3" fails.
>>
>> As a result, server gets message 1,2,4, and succeeds with RR 1,2,4.
>> But clients sees that message 2 fails, and succeed with message 1,3,4.
>>  This inconsistency is the problem that puzzled me.
>>
>> 
>> By the way, how to interpret the parameters for RDMA, and what are
>> parameters that control RDMA behavior?  Below are something I can
>> find, there must be more
>>
>>   max_qp_rd_atom:                 4
>>   max_res_rd_atom:                258048
>>   max_qp_init_rd_atom:            128
>>
>>   qp_attr.max_dest_rd_atomic
>>   qp_attr.max_rd_atomic
>>
>>
>>
>> -neutron
>>
>>
>>
>> On Tue, Nov 10, 2009 at 2:04 AM, Paul Grun 
>> wrote:
>>
>>>
>>> Is it possible that you exceeded the number of available RDMA Read
>>> Resources
>>> available on the server?  There is an expectation that the client knows
>>> how
>>> many outstanding RDMA Read Requests the responder (server) is capable of
>>> handling; if the requester (client) exceeds that number, the responder
>>> will
>>> indeed return a NAK-Invalid Request.  Sounds like your server is
>>> configured
>>> to accept three outstanding RDMA Read Requests.
>>> This also explains why it works when you pause the program
>>> periodically...it
>>> gives the responder time to generate the RDMA Read Responses and
>>> therefore
>>> free up some resources to be used in receiving the next incoming RDMA
>>> Read
>>> Request.
>>>
>>> -Paul
>>>
>>> -Original Message-
>>> From: linux-rdma-ow...@vger.kernel.org
>>> [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of neutron
>>> Sent: Monday, November 09, 2009 9:04 PM
>>> To: linux-rdma@vger.kernel.org
>>> Subject: back to back RDMA read fail?
>>>
>>> Hi all,
>>>
>>> I have a simple program that test back to back RDMA read performance.
>>> However I encountered errors for unknown reasons.
>>>
>>> The basic flow of my program is:
>>>
>>> client:
>>> ibv_post_send() to send 4 back to back messages to server (no delay
>>> inbetween). Each message contains the (rkey, addr, size) of a local
>>> buffer. The buffer is registered with remote-read/write/ permissions.
>>> After that, ibv_poll_cq() is called to wait for completion.
>>>
>>> server:
>>> First, enough receive WRs are posted to the RQ.  Upon receipt of each
>>> message, immediately post a RDMA read request, using the (rkey, addr,
>>> size) information contained in the originating message.
>>>
>>> --
>>> Both client and server use RC QP.  Some errors are observed.
>>>
>>> On client side,  ibv_poll_cq() gets 4 CQE, one out of the 4 CQE is an
>>> error:
>>> CQ::  wr_id=0x0, wc_opcode=IBV_WC_SEND, wc_status=remote invalid RD
>>> request, wc_flag=0x3b
>>>     byte_len=11338758, immdata=1110104528, qp_num=0x0, src_qp=2290530758
>>>
>>> The other 3 CQE are success.
>>>
>>> On server side,
>>> 3 of the 4 messages are successfully received. One message produces an
>>> error CQE:
>>> CQ::  wr_id=0x80, wc_opcode=Unknow-wc-opcode,
>>> wc_status=unknown, wc_flag=0x0
>>>     byte_len=9569287, immdata=0, qp_num=0x0, src_qp=265551872
>>>
>>> The 3 RDMA read corresponding to the successful receive all succeed.
>>>
>>> But, if I pause the client program for a short while( usleep(100) for
>>> example ) after calling ibv_post_send(), then no error occurs.
>>> Anyone can point out the pitfall here? Thanks!
>>>
>>>
>>> ---
>>> On both client and server, I'm using  'mthca0' type MT25208.  The QPs
>>> are initialized with "qp_attr.max_dest_rd_atomic=4,
>>> qp_attr.max_rd_atomic = 4".  The QP's "devinfo -v" gives the
>>> information:
>>>
>>> hca_id: mthca0
>>>       fw_ver:

[PATCH] librdmacm/mckey: add notifications on events

2009-11-11 Thread Or Gerlitz

add notifications on multicast error and address change events which
can take place while traffic is running.

Signed-off-by: Or Gerlitz 

Index: librdmacm/examples/mckey.c
===
--- librdmacm.orig/examples/mckey.c
+++ librdmacm/examples/mckey.c
@@ -62,6 +62,7 @@ struct cmatest_node {

 struct cmatest {
struct rdma_event_channel *channel;
+   pthread_t   cmathread;
struct cmatest_node *nodes;
int conn_index;
int connects_left;
@@ -319,6 +320,30 @@ static int cma_handler(struct rdma_cm_id
return ret;
 }

+static void *cma_thread(void *arg)
+{
+   struct rdma_cm_event *event;
+   int ret;
+
+   while (1) {
+   ret = rdma_get_cm_event(test.channel, &event);
+   if (ret) {
+   perror("rdma_get_cm_event");
+   exit(ret);
+   }
+   switch (event->event) {
+   case RDMA_CM_EVENT_MULTICAST_ERROR:
+   case RDMA_CM_EVENT_ADDR_CHANGE:
+   printf("mckey: event: %s, status: %d\n",
+  rdma_event_str(event->event), event->status);
+   break;
+   default:
+   break;
+   }
+   rdma_ack_cm_event(event);
+   }
+}
+
 static void destroy_node(struct cmatest_node *node)
 {
if (!node->cma_id)
@@ -475,6 +500,7 @@ static int run(void)
if (ret)
goto out;

+   pthread_create(&test.cmathread, NULL, cma_thread, NULL);
/*
 * Pause to give SM chance to configure switches.  We don't want to
 * handle reliability issue in this simple test program.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] OpenSM: Fix unused variable compiler warning.

2009-11-11 Thread Sasha Khapyorsky

On 10:41 Tue 10 Nov , Ira Weiny wrote:
> 
> From: Ira Weiny 
> Date: Tue, 10 Nov 2009 10:39:47 -0800
> Subject: [PATCH] OpenSM: Fix unused variable compiler warning.
> 
> 
> Signed-off-by: Ira Weiny 

Applied. Thanks.

Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] opensm/partition: keep multicast group pointer

2009-11-11 Thread Sasha Khapyorsky


Instead of MLID value (which may refer to more than one MGIDs) keep
pointer to related multicast group object in partition structure.

Signed-off-by: Sasha Khapyorsky 
---
 opensm/include/opensm/osm_partition.h |   11 ++-
 opensm/opensm/osm_prtn.c  |6 +++---
 opensm/opensm/osm_qos_policy.c|   21 +
 3 files changed, 14 insertions(+), 24 deletions(-)

diff --git a/opensm/include/opensm/osm_partition.h 
b/opensm/include/opensm/osm_partition.h
index 3c8a5aa..fdb34b9 100644
--- a/opensm/include/opensm/osm_partition.h
+++ b/opensm/include/opensm/osm_partition.h
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef __cplusplus
 #  define BEGIN_C_DECLS extern "C" {
@@ -92,8 +93,8 @@ BEGIN_C_DECLS
 typedef struct osm_prtn {
cl_map_item_t map_item;
ib_net16_t pkey;
-   ib_net16_t mlid;
uint8_t sl;
+   osm_mgrp_t *mgrp;
cl_map_t full_guid_tbl;
cl_map_t part_guid_tbl;
char name[32];
@@ -106,13 +107,13 @@ typedef struct osm_prtn {
 *  pkey
 *  The IBA defined P_KEY of this Partition.
 *
-*  mlid
-*  The network ordered LID of the well known Multicast Group
-*  that was created for this partition.
-*
 *  sl
 *  The Service Level (SL) associated with this Partiton.
 *
+*  mgrp
+*  The pointer to the well known Multicast Group
+*  that was created for this partition (when configured).
+*
 *  full_guid_tbl
 *  Container of pointers to all Port objects in the Partition
 *  with full membership, indexed by port GUID.
diff --git a/opensm/opensm/osm_prtn.c b/opensm/opensm/osm_prtn.c
index 4f84a80..f1094e3 100644
--- a/opensm/opensm/osm_prtn.c
+++ b/opensm/opensm/osm_prtn.c
@@ -225,7 +225,7 @@ ib_api_status_t osm_prtn_add_mcgroup(osm_log_t * p_log, 
osm_subn_t * p_subn,
cl_ntoh16(pkey));
if (p_mgrp) {
p_mgrp->well_known = TRUE;
-   p->mlid = p_mgrp->mlid;
+   p->mgrp = p_mgrp;
}
 
/* workaround for TS */
@@ -240,8 +240,8 @@ ib_api_status_t osm_prtn_add_mcgroup(osm_log_t * p_log, 
osm_subn_t * p_subn,
  &p_mgrp);
if (p_mgrp) {
p_mgrp->well_known = TRUE;
-   if (!p->mlid)
-   p->mlid = p_mgrp->mlid;
+   if (!p->mgrp)
+   p->mgrp = p_mgrp;
}
 
return status;
diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c
index fcb9935..ed631c9 100644
--- a/opensm/opensm/osm_qos_policy.c
+++ b/opensm/opensm/osm_qos_policy.c
@@ -772,7 +772,6 @@ static void __qos_policy_validate_pkey(
uint8_t sl;
uint32_t flow;
uint8_t hop;
-   osm_mgrp_t * p_mgrp;
 
if (!p_qos_policy || !p_qos_match_rule || !p_prtn)
return;
@@ -792,31 +791,21 @@ static void __qos_policy_validate_pkey(
 
/* If this partition is an IPoIB partition, there should
   be a matching MCast group. Fix this group's SL too */
-
-   if (!p_prtn->mlid)
-   return;
-
-   p_mgrp = osm_get_mgrp_by_mlid(p_qos_policy->p_subn, p_prtn->mlid);
-   if (!p_mgrp) {
-   OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_ERROR,
-   "ERR AC16: MCast group for partition with "
-   "pkey 0x%04X not found\n",
-   cl_ntoh16(p_prtn->pkey));
+   if (!p_prtn->mgrp)
return;
-   }
 
-   CL_ASSERT((cl_ntoh16(p_mgrp->mcmember_rec.pkey) & 0x7fff) ==
+   CL_ASSERT((cl_ntoh16(p_prtn->mgrp->mcmember_rec.pkey) & 0x7fff) ==
  (cl_ntoh16(p_prtn->pkey) & 0x7fff));
 
-   ib_member_get_sl_flow_hop(p_mgrp->mcmember_rec.sl_flow_hop,
+   ib_member_get_sl_flow_hop(p_prtn->mgrp->mcmember_rec.sl_flow_hop,
  &sl, &flow, &hop);
if (sl != p_prtn->sl) {
OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG,
"Updating MCGroup (MLID 0x%04x) SL to "
"match partition SL (%u)\n",
-   cl_hton16(p_mgrp->mcmember_rec.mlid),
+   cl_hton16(p_prtn->mgrp->mcmember_rec.mlid),
p_prtn->sl);
-   p_mgrp->mcmember_rec.sl_flow_hop =
+   p_prtn->mgrp->mcmember_rec.sl_flow_hop =
ib_member_set_sl_flow_hop(p_prtn->sl, flow, hop);
}
 }
-- 
1.6.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SRPT and SCST

2009-11-11 Thread Vladislav Bolkhovitin


Arend Dittmer, on 11/11/2009 03:33 AM wrote:
To Bart's earlier question  ... we apologize for not being able to come up with a time and date when the initiators lost contact with the target. We have not been able to test with an initiator from a vanilla kernel. We only tested with the initiator that ships with RedHat 5.3. We built the module against a slightly modified RedHat kernel that includes process management patches that allow for a unified process space for our cluster management software Scyld clusterware. Our patches do not affect any storage components. 


[r...@head0 ~]# bpsh 1 modinfo ib_srp
filename:   
/lib/modules/2.6.18-128.1.1.el5.530g/kernel/drivers/infiniband/ulp/srp/ib_srp.ko

license:Dual BSD/GPL
description:InfiniBand SCSI RDMA Protocol initiator v0.2 (November 1, 2005)
author: Roland Dreier
srcversion: 23B2629641E1A475BF72F44
depends:ib_core,scsi_mod,ib_cm,ib_sa
vermagic:   2.6.18-128.1.1.el5.530g SMP mod_unload gcc-4.1
parm:   srp_sg_tablesize:Max number of gather/scatter entries per I/O 
(default is 12) (int)
parm:   topspin_workarounds:Enable workarounds for Topspin/Cisco SRP 
target bugs if != 0 (in
t)
parm:   mellanox_workarounds:Enable workarounds for Mellanox SRP target 
bugs if != 0 (int)


Hmm, "workarounds for Mellanox SRP target", i.e. for SCST SRP target? 
For what are those workarounds and why don't fix the corresponding 
problems in the target? From the source code it isn't obvious..



parm:   srp_dev_loss_tmo:Default number of seconds that srp transport 
should  in
sulate the lost of a remote port (default is 60 secs (int)
module_sig: 
883f35049c0555e56ccec1c0ba19c3112f87b09e2872185017b618b6026be92291a62b5446018e009d1e
3299cd274ad8e31c3d0b03081b112959d4d84

Also ... we ran with only a single thread.

Thanks

Arend 



-Original Message-
From: Chris Worley [mailto:worl...@gmail.com]
Sent: Mon 11/9/2009 3:43 PM
To: Vladislav Bolkhovitin
Cc: Bart Van Assche; Arend Dittmer; Philip Pokorny; 
scst-de...@lists.sourceforge.net; linux-rdma@vger.kernel.org; Vu Pham
Subject: Re: SRPT and SCST
 
On Mon, Nov 9, 2009 at 1:26 PM, Vladislav Bolkhovitin  wrote:

Bart Van Assche, on 11/08/2009 12:49 PM wrote:

On Fri, Nov 6, 2009 at 6:28 PM, Arend Dittmer
 wrote:

Please find attached the gzip'ed /var/log/messages.

This log clearly show the login and logout actions from the different
initiators. I couldn't find anything unusual in the posted log file
however. Around which time did the initiator start complaining about
aborted SCSI commands ? Does this issue also happen when using the SRP
initiator included in a vanilla (non-OFED) Linux kernel ?

It looks painfully similar to what Chris Worley experienced some time ago
and somehow fixed/workarounded.

Chris, can you comment on this?


The "thread=1" fixed the problem mostly, but I am working with another
group that says they still get an abort, but haven't gotten around to
providing me with the info I need to look at it.

Chris

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RDMA] Fixup IPv6 support and IPv4 routing corner cases for RDMA CM

2009-11-11 Thread Or Gerlitz


Sean Hefty wrote:

I'll compare my final patches against the ones submitted by David to see if 
anything got missed
  
Are Jason's patches a superset of David's patches? or they need to be 
applied and only then David's work can be re-reviewed/merged, etc?


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: back to back RDMA read fail?

2009-11-11 Thread Dotan Barak


Hi.

how do you connect the QPs?
via CM/CMA or by sockets (and you actually call the ibv_modify_qp)?

Dotan

neutron wrote:

Hi Paul, thanks a lot for your quick reply!

In my test,  client informs the server of its local memory (rkey,
addr, size) by sending 4 back to back messages,  each message elicits
a RDMA read request (RR) from the server.

In other words, client exposes its memory to the server, and server
RDMA reads it.

As far as RDMA read is concerned, server is a requester, and client is
a responder, right?

The error I encountered happens at the initial phase, when client
sends 4 back to back messages to server(using ibv_post_send ),
containing (rkey, addr, size) client's local memory.

In these 4 ibv_post_send(), client will see one failure.   At server
side, server has already posted enough WQs in the RQ.  The failures
are included in my first email.

Looking at the program output, it appears that, server gets messages
1, issues RR 1, gets message 2, issues RR 2.But somehow client
reports that "send message 2" fails.

On the contrary, server reports "receive message 3" fails.

As a result, server gets message 1,2,4, and succeeds with RR 1,2,4.
But clients sees that message 2 fails, and succeed with message 1,3,4.
  This inconsistency is the problem that puzzled me.


By the way, how to interpret the parameters for RDMA, and what are
parameters that control RDMA behavior?  Below are something I can
find, there must be more

   max_qp_rd_atom: 4
   max_res_rd_atom:258048
   max_qp_init_rd_atom:128

   qp_attr.max_dest_rd_atomic
   qp_attr.max_rd_atomic



-neutron



On Tue, Nov 10, 2009 at 2:04 AM, Paul Grun  wrote:
  

Is it possible that you exceeded the number of available RDMA Read Resources
available on the server?  There is an expectation that the client knows how
many outstanding RDMA Read Requests the responder (server) is capable of
handling; if the requester (client) exceeds that number, the responder will
indeed return a NAK-Invalid Request.  Sounds like your server is configured
to accept three outstanding RDMA Read Requests.
This also explains why it works when you pause the program periodically...it
gives the responder time to generate the RDMA Read Responses and therefore
free up some resources to be used in receiving the next incoming RDMA Read
Request.

-Paul

-Original Message-
From: linux-rdma-ow...@vger.kernel.org
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of neutron
Sent: Monday, November 09, 2009 9:04 PM
To: linux-rdma@vger.kernel.org
Subject: back to back RDMA read fail?

Hi all,

I have a simple program that test back to back RDMA read performance.
However I encountered errors for unknown reasons.

The basic flow of my program is:

client:
ibv_post_send() to send 4 back to back messages to server (no delay
inbetween). Each message contains the (rkey, addr, size) of a local
buffer. The buffer is registered with remote-read/write/ permissions.
After that, ibv_poll_cq() is called to wait for completion.

server:
First, enough receive WRs are posted to the RQ.  Upon receipt of each
message, immediately post a RDMA read request, using the (rkey, addr,
size) information contained in the originating message.

--
Both client and server use RC QP.  Some errors are observed.

On client side,  ibv_poll_cq() gets 4 CQE, one out of the 4 CQE is an error:
CQ::  wr_id=0x0, wc_opcode=IBV_WC_SEND, wc_status=remote invalid RD
request, wc_flag=0x3b
 byte_len=11338758, immdata=1110104528, qp_num=0x0, src_qp=2290530758

The other 3 CQE are success.

On server side,
3 of the 4 messages are successfully received. One message produces an
error CQE:
CQ::  wr_id=0x80, wc_opcode=Unknow-wc-opcode,
wc_status=unknown, wc_flag=0x0
 byte_len=9569287, immdata=0, qp_num=0x0, src_qp=265551872

The 3 RDMA read corresponding to the successful receive all succeed.

But, if I pause the client program for a short while( usleep(100) for
example ) after calling ibv_post_send(), then no error occurs.
Anyone can point out the pitfall here? Thanks!


---
On both client and server, I'm using  'mthca0' type MT25208.  The QPs
are initialized with "qp_attr.max_dest_rd_atomic=4,
qp_attr.max_rd_atomic = 4".  The QP's "devinfo -v" gives the
information:

hca_id: mthca0
   fw_ver: 5.1.400
   node_guid:  0002:c902:0023:c04c
   sys_image_guid: 0002:c902:0023:c04f
   vendor_id:  0x02c9
   vendor_part_id: 25218
   hw_ver: 0xA0
   board_id:   MT_0370130002
   phys_port_cnt:  2
   max_mr_size:0x
   page_size_cap:  0xf000
   max_qp: 64512
   max_qp_wr:  16384
   device_cap_flags:   0x1c76

Re: [PATCH] opensm/osm_state_mgr.c: force heavy sweep when fabric consists of single switch

2009-11-11 Thread Yevgeny Kliteynik


Eli Dorfman (Voltaire) wrote:

Yevgeny Kliteynik wrote:

Eli Dorfman (Voltaire) wrote:

Yevgeny Kliteynik wrote:

Eli Dorfman (Voltaire) wrote:

Yevgeny Kliteynik wrote:

Yevgeny Kliteynik wrote:

Line Holen wrote:

On 11/ 4/09 04:54 PM, Yevgeny Kliteynik wrote:

Line Holen wrote:

On 11/ 4/09 10:47 AM, Yevgeny Kliteynik wrote:

Sasha Khapyorsky wrote:

On 12:26 Tue 03 Nov , Yevgeny Kliteynik wrote:

Always do heavy sweep when there is only one node in the
fabric, and this node is a switch, and SM runs on top of it -
there may be a race when OSM starts running before the
external ports are ports are up, or if they went through
reset while SM was starting.
In this race switch brings up the ports and turns on the
PSC bit, but OSM might get PortInfo before SwitchInfo, and it
might see all ports as down, but PSC bit on. If that happens,
OSM turns off PSC bit, and it will never see external ports
again - it won't perform any heavy sweep, only light sweep

Could such race happen when there are more than one node in a
fabric?

I think that my description of the race was misleading.
The race can happen on *any* fabric when SM runs on switch.
But when it does happen, SM thinks that the whole subnet
is just one switch - that's what it managed to discover.
I've actually seen it happening.
So the patch fixes this particular case.

So the next question that you would probably ask is can
this race happen on some *other* switch and not the one
SM is running on?

Well, I don't know. I have a hunch that it can't, but I
couldn't prove it to myself yet.

The race on the managed switch is a special case because
SM always sees port 0, and always gets responses to its
SMP queries. On any other switch, if the ports were reset,
SM won't get any response until the ports are up again.

Perhaps there might be a case where SM got some port as down,
and by the time SM got SwitchInfo with PSC bit the port
was already up, so SM won't start discovery beyond this
port. But this race would be fixed on the next heavy sweep,
when SM will discover this port that it missed the previous
time, whereas race on managed switch is fatal - SM won't
ever do any heavy sweep.

-- Yevgeny

At least for the 3.2 branch there is a general race regardless of
where the SM is running. I haven't checked the current master, but
I cannot recall seeing any patches related to this so I assume
the race is still there.

There is a window between SM discovering a switch and clearing PSC
for the same switch. The SM will not detect a state change on the
switch ports during this time.

If the port changes state during that period, the switch issues
new trap 128, which (I think) should cause SM to re-discover the
fabric once this discovery cycle is over. Is this correct?


I think the switch shall send a trap whenever it sets the PSC bit.
Once set I believe it will not send another trap until it is reset.
Or do I misinterpret the spec ?

I may be wrong, but I thought that this is how things work:
- port state changes
- switch turns on PSC bit and starts sending traps
- SM gets the trap, sends trap repress
- switch gets trap repress and stops sending traps
- PSC is still on
- port state changes again (the same or any other port)
- switch turns on PSC bit (which doesn't matter as PSC is
  already on) and starts sending traps again
- etc...

Anyway, I'll double-check this issue.

Yep, verified.
Switch sends traps regardless the PSC bit status.
Also, the spec doesn't link them together:

  o14-5.1.1: If a switch supports Traps (PortInfo:
  CapabilityMask.IsTrap-Supported is one), its SMA
  shall send trap 128 to the SM indicated by the  
PortInfo:MasterSMLID

under any condition that   would cause SwitchInfo:PortStateChange to
be set
  to one. (See 14.2.5.4 SwitchInfo on page 827.)


Trap will be sent according to the SMLID. After first bring up the
SMLID is not set yet and trap will not be sent.
In that case the opensm would discover the change only by PSC bit.
For IS3 chips the PSC bit and/or trap were set only after one or more
ports changed their state, so I don't understand how can the SM
discover PSC bit set while all ports are down. Or is this a change in
IS4?

It can happen when SM runs on the switch, not not host.
In this case if all ports are going down, SM will see
them all down and it will see PSC bit on.

So this patch is only for SM running on a switch which is the only
node in the fabric?
I don't see the race when there is more than one switch - please explain.

Quoting from above:

  The race can happen on *any* fabric when SM runs on switch.
  But when it does happen, SM thinks that the whole subnet
  is just one switch - that's what it managed to discover.


I saw that but I don't understand how this can happen.
If PSC bit is set after *every* port state change and
SM clears PSC bit before reading PortInfo from the switch,


osm_node_info_rcv.c, ni_rcv_process_switch():
I see in the code that SM receives NodeInfo, then it requests
SwitchInfo and right a

47 matches

Mail list logo