from:"\"Or Gerlitz\""


Hal Rosenstock wrote:

On Thu, Oct 29, 2009 at 10:23 PM, Sasha Khapyorsky  wrote:

Implementation description would be very useful. What does "initial support" 
mean?

It means there's more to come in terms of using 
OptimizedSLtoVLMappingProgramming. This is the simplest use/introduction of 
this optional feature.
You can just send people to reads specs, your change log should explain 
what the patch is about, if this is a big change to opensm, maybe even 
RFC it will a detailed writeup


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RDMA] Fixup IPv6 support and IPv4 routing corner cases for RDMA CM


Jason Gunthorpe wrote:

On Wed, Oct 28, 2009 at 10:05:19AM -0700, Sean Hefty wrote:

A UD endpoint can communicate using multicast and to other UD endpoints.  A 
user could resolve a UD endpoint before joining a multicast group.


So the IP world analog would be:
fd = socket(AF_INET,SOCK_DGRAM);
connect(fd,'Some Unicast Address');
setsockopt(fd,IP_MULITCAST_ADD_MEMBERSHIP,'Some Multicast Address');
sendto(fd,...,'Some Multicast Address');

IP multicast senders don't call IP_ADD_MEMBERSHIP, only receivers

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RESEND] ib/iser: re-write SG handling for rdma logic

After dma-mapping an SG list provided by the SCSI midlayer, iser has
to make sure the mapped SG is "aligned for RDMA" in the sense that its
possible to produce one mapping in the HCA IOMMU which represents the
whole SG. Next, the mapped SG is formatted for registration with the HCA.

This patch re-writes the logic that does the above, to make it clearer
and simpler. It also fixes a bug in the being aligned for rdma checks,
where a "start" check wasn't done but rather only "end" check.

Signed-off-by: Alexander Nezhinsky 
Signed-off-by: Or Gerlitz 

Index: linux-2.6.32-rc5/drivers/infiniband/ulp/iser/iser_memory.c
===
--- linux-2.6.32-rc5.orig/drivers/infiniband/ulp/iser/iser_memory.c
+++ linux-2.6.32-rc5/drivers/infiniband/ulp/iser/iser_memory.c
@@ -209,6 +209,8 @@ void iser_finalize_rdma_unaligned_sg(str
mem_copy->copy_buf = NULL;
 }

+#define IS_4K_ALIGNED(addr)unsigned long)addr) & ~MASK_4K) == 0)
+
 /**
  * iser_sg_to_page_vec - Translates scatterlist entries to physical addresses
  * and returns the length of resulting physical address array (may be less than
@@ -221,62 +223,52 @@ void iser_finalize_rdma_unaligned_sg(str
  * where --few fragments of the same page-- are present in the SG as
  * consecutive elements. Also, it handles one entry SG.
  */
+
 static int iser_sg_to_page_vec(struct iser_data_buf *data,
   struct iser_page_vec *page_vec,
   struct ib_device *ibdev)
 {
-   struct scatterlist *sgl = (struct scatterlist *)data->buf;
-   struct scatterlist *sg;
-   u64 first_addr, last_addr, page;
-   int end_aligned;
-   unsigned int cur_page = 0;
+   struct scatterlist *sg, *sgl = (struct scatterlist *)data->buf;
+   u64 start_addr, end_addr, page, chunk_start = 0;
unsigned long total_sz = 0;
-   int i;
+   unsigned int dma_len;
+   int i, new_chunk, cur_page, last_ent = data->dma_nents - 1;

/* compute the offset of first element */
page_vec->offset = (u64) sgl[0].offset & ~MASK_4K;

+   new_chunk = 1;
+   cur_page  = 0;
for_each_sg(sgl, sg, data->dma_nents, i) {
-   unsigned int dma_len = ib_sg_dma_len(ibdev, sg);
-
+   start_addr = ib_sg_dma_address(ibdev, sg);
+   if (new_chunk)
+   chunk_start = start_addr;
+   dma_len = ib_sg_dma_len(ibdev, sg);
+   end_addr = start_addr + dma_len;
total_sz += dma_len;

-   first_addr = ib_sg_dma_address(ibdev, sg);
-   last_addr  = first_addr + dma_len;
-
-   end_aligned   = !(last_addr  & ~MASK_4K);
-
-   /* continue to collect page fragments till aligned or SG ends */
-   while (!end_aligned && (i + 1 < data->dma_nents)) {
-   sg = sg_next(sg);
-   i++;
-   dma_len = ib_sg_dma_len(ibdev, sg);
-   total_sz += dma_len;
-   last_addr = ib_sg_dma_address(ibdev, sg) + dma_len;
-   end_aligned = !(last_addr  & ~MASK_4K);
+   /* collect page fragments until aligned or end of SG list */
+   if (!IS_4K_ALIGNED(end_addr) && i < last_ent) {
+   new_chunk = 0;
+   continue;
}
+   new_chunk = 1;

-   /* handle the 1st page in the 1st DMA element */
-   if (cur_page == 0) {
-   page = first_addr & MASK_4K;
-   page_vec->pages[cur_page] = page;
-   cur_page++;
+   /* address of the first page in the contiguous chunk;
+  masking relevant for the very first SG entry,
+  which might be unaligned */
+   page = chunk_start & MASK_4K;
+   do {
+   page_vec->pages[cur_page++] = page;
page += SIZE_4K;
-   } else
-   page = first_addr;
-
-   for (; page < last_addr; page += SIZE_4K) {
-   page_vec->pages[cur_page] = page;
-   cur_page++;
-   }
-
+   } while (page < end_addr);
}
+
page_vec->data_size = total_sz;
iser_dbg("page_vec->data_size:%d cur_page %d\n", 
page_vec->data_size,cur_page);
return cur_page;
 }

-#define IS_4K_ALIGNED(addr)unsigned long)addr) & ~MASK_4K) == 0)

 /**
  * iser_data_buf_aligned_len - Tries to determine the maximal correctly aligned
@@ -284,42 +276,40 @@ static int iser_sg_to_page_vec(struct is
  * the number of entries which are aligned correctly. Supports the case where
  * consecutive

Re: [PATCH v3] [RFC] rdma/cm: support option to allow manually setting IB path


Sean Hefty wrote:

Jason and Or, does this seem ready to queue for 2.6.33?
Roland, I have missed your email last week, anyway, as I wrote Sean 
earlier, I'm totally fine with this patch of allowing user space to set 
a patch record for the kernel.


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] rdma/cm: support option to allow manually setting IB path

2009-11-01 Thread Or Gerlitz


Sean Hefty wrote:

Future changes to the rdma cm can expand on this framework to support the full 
range of features allowed by the IB CM, such as separate forward and reverse 
paths and APM

Sean,

Before enhancing the rdma-cm to support the full feature set of the IB 
CM, something which I personally don't see the actual need for (but I 
will be happy to get educated what applications will or can migrate to 
rdma-cm once this is implemented), how about trying to allow for reduced 
QoS scheme also when the entity that resolved this patch didn't 
consulted with the SA?


IB QoS is based on the query providing the  
tuple and the SA returning a  QoS tuple. Now 
I'd like to see how can we let the application / querying middleware to 
take advantage of the knowledge on what partition it runs and use the SL 
associated with the IPv4 (e.g AF_INET rdma-cm ID's) IPoIB broadcast 
group. This way, one can still program a QoS scheme at the SA which is 
based on partitions.


Looking on mckey, the user space code (e.g ACM), could just do rdma_bind 
to an IP address of an IPoIB NIC that uses this partition and then 
rdma_join to an unmapped multicast address which correspond to the 
broadcast group, take the SL and leave the group, makes sense?


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] librdmacm/mckey: enforce local binding for unmapped multicast addresses

2009-11-01 Thread Or Gerlitz

enforce local binding is specified for unmapped multicast addresses, otherwise 
mckey
crashes when attempting to use the cma_id->verbs pointer in the port query verb.

Signed-off-by: Or Gerlitz 

Sean, using unmapped multicast addresses I see that a different broacast group 
is
created by the SM such that mckey doesn't manage to join the ipv4 broadcast 
group

$ ./mckey -M ff12:401b::0:0:0:: -b 10.10.5.62 -p 0x2

mckey: joined dgid: ff12:401b::: mlid c00b sl 0

looking in the SA, I see that the MGID used by the rdma-cm is a bif different
from the one used by IPoIB, since the former uses/set only the lower 28 bits 
where
the latter sets the lower 32 bits for this mgid, any idea what can be  done 
here?

$ saquery $THIS_NODE_LID

MCMemberRecord group dump:
MGIDff12:401b::::
Mlid0xC000
Mtu.0x84
pkey0x
Rate0x83
SL..0x0


MCMemberRecord group dump:
MGIDff12:401b:::fff:
Mlid0xC00B
Mtu.0x84
pkey0x
Rate0x83
SL..0x0


Index: librdmacm/examples/mckey.c
===
--- librdmacm.orig/examples/mckey.c
+++ librdmacm/examples/mckey.c
@@ -273,7 +273,7 @@ static int join_handler(struct cmatest_n
char buf[40];

inet_ntop(AF_INET6, param->ah_attr.grh.dgid.raw, buf, 40);
-   printf("mckey: joined dgid: %s\n", buf);
+   printf("mckey: joined dgid: %s mlid %x sl %d\n", buf, 
param->ah_attr.dlid, param->ah_attr.sl);

node->remote_qpn = param->qp_num;
node->remote_qkey = param->qkey;
@@ -556,6 +556,11 @@ int main(int argc, char **argv)
}
}

+   if (unmapped_addr && !src_addr) {
+   printf("unmapped multicast address requires binding to source 
address\n");
+   exit(1);
+   }
+
test.dst_addr = (struct sockaddr *) &test.dst_in;
test.connects_left = connections;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Crash in bonding

2009-11-02 Thread Or Gerlitz


Pradeep Satyanarayana wrote:
This crash was originally reported against Rhel5.4. However, one can recreate this crash quite easily in OFED-1.5 too. 
I understand that you get the crash when working with the RHEL5.4 
bonding driver, correct? does it happen only with IPoIB devices acting 
as the bonding slaves or also with Ethernet devices? Please note that 
with RHEL 5.4 there's no need to use the ofed provided bonding module, 
more over, I believe that the distro provided one is more stable and 
uptodate in this case. Moving forward, ofed bonding support for newish 
distributions is to be removed. Moni, any reason to support bonding/EL 
5.4 in ofed?


Or.


The steps to recreate the crash are as follows:
1. Run traffic (I used ping) on the IB interfaces through the bond master
2. ifdown ib0
3. ifdown ib1
4. modprobe -r ib_ipoib


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] librdmacm/mckey: enforce local binding for unmapped multicast addresses

2009-11-03 Thread Or Gerlitz


Sean Hefty wrote:

Unmapped multicast groups only support the case where the SA has created the
group with the MGID undefined.  The MGID must be in this format: 0xff1 scope 
0xA01B (see figure 196 on page 928 of the spec).  The kernel checks for this 
specific address format to see if it needs to convert the address or not [...] 
wanted the ability to create a group a get back a unique group ID
I am still not sure to follow you. My basic thought was that unmapped 
multicast addresses are MGIDs specified by the application such that 
rdma-cm doesn't treat them as IPv6 multicast address and no mapping is 
applied on them. From the spec location you have pointed me I understand 
that the intention is for a request to the SA to generate a unique MGID:


1. "if SA receives a request to create a multicast group with the MGID 
undefined"

2.  "the MGID that it creates shall be of the following format"

so there are two parts here, 1st request the SA to create a new group, 
assign it an MGID (what about joining this node/port to the group), 2nd, 
getting back the MGID created by the SA. Looking on the rdma-cm kernel 
code, I don't see where/how it specifies to the SA  that the MGID is 
undefined? shouldn't it not set the MGID bit in the component mask in 
this case? next, I don't see where the MGID created by the SA is given 
back to the application. I guess still miss something here, can you 
clarify, thanks


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 19/25] mlx4: Randomizing mac addresses for slaves

2009-11-04 Thread Or Gerlitz

On Wed, Nov 4, 2009 at 10:04 PM, Roland Dreier  wrote:
>> +#define MLX4_MAC_HEAD               0x2c900ULL

> Is this a good idea?  You're basically choosing 24 random bits within your 
> OUI...
> seems the chance of collision with another MAC used on the same network is
> high enough that it could easily happen in practice on a moderately big 
> network.

yes, this has been brought by Stephen and others on this last back on
September 11th, this year @
http://marc.info/?l=linux-netdev&m=125263488409128

> Can you pick a reserved range or something?

Using different OUI for the VF device wouldn't help either I think,
since the #VF becomes fairly big even on a modest side cluster with
(say) a VM consuming VF per 1-2 cores.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] librdmacm/mckey: enforce local binding for unmapped multicast addresses

2009-11-07 Thread Or Gerlitz


Sean Hefty wrote:

I merged this with your other patch to mckey and applied them to my tree
  
I don't see this @ 
http://www.openfabrics.org/git/?p=~shefty/librdmacm.git, were you 
referring a local clone?


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: QoS in local SA entity

2009-11-07 Thread Or Gerlitz


Sean Hefty wrote:

I wasn't trying to limit how the SA could 'distribute' QoS information to the 
end nodes.  ACM will obtain QoS information from the SA when it joins its
multicast groups
excellent... still, this is dependent on how the ACM MGIDs are 
constructed, I'll take a look on the code.



ACM is intended to be a service that's used by the librdmacm to resolve address 
mappings and routes.  Trying to have ACM use the librdmacm ends up with a 
circular dependency.  That's the part I'm trying to avoid.


fail-enough, I believe that my suggestion is doable also without 
circular dependency, e.g as you indicated below or with a fairly small 
enhancement of librdmacm, see next




ACM uses address mappings as defined in an address configuration file (IP ->
device, port, pkey).  The address file can be created using the provided 
ib_acme utility, which uses the current system configuration (in an ugly way, 
but it works).  I think this provides QoS behavior similar to what you're 
describing
I assume you are referring to an IP local to the system where ACM runs 
on correct? this would work well for applications calling rdma_bind 
and/or rdma_resolve_address while specifying a source address. To 
support also the case of application which do neither of these two, that 
is call rdma_resolve_addr with dest address only, I suggest to enhance 
librdmacm-calling-ACM flow and resolve the source address using route 
lookup from user space, next the librdmacm can issue rdma_bind on behalf 
of this ID and you have the  triplet at your hand so 
now the ACM call can be made form librdmacm. Writing this, I realized 
that better(should) be done also for apps _resove_addr with src ip 
specified. This way you have unified flow for the ACM use in librdmacm 
for either of apps A,B,C below


A.1 rdma_bind(src=X)
A.2 rdma_resolve_addr(src=null, dst=Y)

B.1 rdma_resolve_addr(src=null, dst=Y)

C.1 rdma_resolve_addr(src=X, dst=Y)

where librdmacm calling-ACM flow is

L1. compute source address
L2. issue kernel rdma_bind to source address and resolve pkey>
L3. issue ACM address (DGID) resolution call using (pkey>, dest-ip)


makes sense? if yes, what's the need in the address configuration file?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: QoS in local SA entity

2009-11-08 Thread Or Gerlitz


Jason Gunthorpe wrote:

The entire point of the rdma_getaddrinfo + AF_IB is to avoid hacking up 
librdmacm for every address lookup/cache scheme someone invents
the entire simple point I am trying to make is that rdma_getaddrinfo + 
AF_INET is doable, is simple and is needed to keep up the essence of the 
rdma-cm. I don't see how AF_IB buys anything to anyone that but if you 
want to push it up as long as AF_INET is first and most 
supported/interoperable future/present go and add your bits. As you 
indicated the route lookup I was mentioning could be done in 
rdma_addrinfo, sure with  &res including both source and destination 
addresses. No rdma_resolve_addr2 is needed the one that exists now has  
source addresses specified, I  don't see that extra info is needed for 
AF_INET that was resolved with rdma_getaddrinfo is this AF_IB specific?


I don't see why the app should bother on calling rdma_getaddrinfo, it 
can be done by librdmacm with rdma_getaddrinfo having multiple modules 
as you suggested. I am in favor of the approach suggested by Sean of 
librdmacm either doing its native flow or under environment variable 
doing an alternative flow, where your suggestion not to have the 2nd 
flow being tightly coupled with ACM, e.g through using get_addrinfo 
abstraction and friends makes sense (yes!)


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RESEND] ib/iser: re-write SG handling for rdma logic




This patch re-writes the logic that does the above, to make it clearer and simpler. It also fixes a 
bug in the being aligned for rdma checks, where a "start" check wasn't done but rather 
only "end" check.
  
Roland, I don't see this patch in your for-next branch, any reason not 
to merge this?


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: QoS in local SA entity


Sean Hefty wrote:

[...] The current implementation of ACM converts this to:
** Source sends a multicast request to destination IP
** Destination sends a response with IP to DGID mapping
- Path record is constructed from multicast group information   
ACM needs to know what the local addresses are, so it can respond to requests
for those addresses
okay got it. Still, how do you see my suggestion on the unified/modified 
librdmacm flow (L1/L2/L3 in my email) which would be taken when working 
against a "DGID/Route" provider such as ACM?


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: QoS in local SA entity

Jason Gunthorpe wrote:

The extra info in rdma_resolve_addr2 carries the IB specific path information
from the rdma_getaddrinfo module to the kernel for the address pair. The entire
purpose of AF_IB is to let user space tell the kernel it does not want a kernel
side ND and PR query, instead user space will provide all the information.
The kernel patches posted by Sean replace the ND/PR flow with a two
steps process, first specifying a DGID to the kernel next specifying a
PATH. My suggestion is to have a librdmacm initiated bind before the
sending the DGID to the kernel, this way AF_INET would be supported
perfectly under the slight limitation that the source address port, pkey> tuple would be chosen by route lookup and not by the
neigh->dev that what resolved by the kernel ND. This is only when the
modified flow of librdmacm is taken (e.g under user specification with
environment variable etc).

--If-- on top of that you want to add AF_IB, we may be able to do that,
but I don't see why the whole thing should be made for AF_IB only.

Think of it this way, ACM takes over the entire process of what AF_INET does in
the kernel. AF_INET talks directly to the IB CM module in the kernel. Thus, it
also makes sense that ACM would need to talk to IB CM directly as well. AF_IB
is that direct connection.

I don't agree we must state it this way. I see ACM as an alternative way
for AF_INET to resolve ND/PR.

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: LID reconfiguration

> One more question;  I saw librdmacm which looked nice but it does not
> support multi-path connections.  It would eliminate a lot of code if we
> could use this

what are your needs?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: LID reconfiguration

Jeff Roberson wrote:
> I would want a way to specify the alternate sockaddr with automatic
> failover between them.  Perhaps with some notification when a failover occured

>From your description I still don't see what the alternate address buys you. 

As was suggested here, bond two IPoIB devices, use the address of the bond in 
your librdmacm based app and automatic HA. You get indications on failover 
through RDMA_CM_EVENT_ADDR_CHANGE, see rdma_get_cm_event(3)

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Crash in bonding

2009-11-10 Thread Or Gerlitz

Pradeep Satyanarayana wrote:

> The crash is specific to IPoIB, and does not happen with Ethernet slaves.

okay

> Can you explain why you plan to remove this from the newer distros? This is 
> indeed news to me

we plan to remove bonding from --ofed-- as the distro provided bonding supports 
ipoib, simple as that, what isn't clear here?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RDMA] Fixup IPv6 support and IPv4 routing corner cases for RDMA CM

2009-11-11 Thread Or Gerlitz


Sean Hefty wrote:

I'll compare my final patches against the ones submitted by David to see if 
anything got missed
  
Are Jason's patches a superset of David's patches? or they need to be 
applied and only then David's work can be re-reviewed/merged, etc?


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] librdmacm/mckey: add notifications on events

2009-11-11 Thread Or Gerlitz

add notifications on multicast error and address change events which
can take place while traffic is running.

Signed-off-by: Or Gerlitz 

Index: librdmacm/examples/mckey.c
===
--- librdmacm.orig/examples/mckey.c
+++ librdmacm/examples/mckey.c
@@ -62,6 +62,7 @@ struct cmatest_node {

 struct cmatest {
struct rdma_event_channel *channel;
+   pthread_t   cmathread;
struct cmatest_node *nodes;
int conn_index;
int connects_left;
@@ -319,6 +320,30 @@ static int cma_handler(struct rdma_cm_id
return ret;
 }

+static void *cma_thread(void *arg)
+{
+   struct rdma_cm_event *event;
+   int ret;
+
+   while (1) {
+   ret = rdma_get_cm_event(test.channel, &event);
+   if (ret) {
+   perror("rdma_get_cm_event");
+   exit(ret);
+   }
+   switch (event->event) {
+   case RDMA_CM_EVENT_MULTICAST_ERROR:
+   case RDMA_CM_EVENT_ADDR_CHANGE:
+   printf("mckey: event: %s, status: %d\n",
+  rdma_event_str(event->event), event->status);
+   break;
+   default:
+   break;
+   }
+   rdma_ack_cm_event(event);
+   }
+}
+
 static void destroy_node(struct cmatest_node *node)
 {
if (!node->cma_id)
@@ -475,6 +500,7 @@ static int run(void)
if (ret)
goto out;

+   pthread_create(&test.cmathread, NULL, cma_thread, NULL);
/*
 * Pause to give SM chance to configure switches.  We don't want to
 * handle reliability issue in this simple test program.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ipath now and then (was [PATCH] IB/core: export struct ib_port)

2009-11-11 Thread Or Gerlitz

On Wed, Nov 11, 2009 at 11:06 PM, Dave Olson  wrote:
> And yes, the ib_ipath is being fully deprecated.  The "full set" of
> patches that adds ib_qib upstream will include a subset that drops
> ib_ipath.   All the bug fixes and feature work have been done for ib_qib

It was brought up in few occasions that the ipath driver can be
changed such that it becomes a software IBoE driver (e.g use packet
socket with the IBoE ether type for the IB L2 emulation).
If it doesn't have to serve for the qlogic HCA anymore, this
transformation might be even eaiser.
I wonder if its better to remove it now and maybe return it later with
the new facelift or leave it till the change is done.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] librdmacm/mckey: add notifications on events

2009-11-12 Thread Or Gerlitz

Sean Hefty wrote:
> mckey is intended to be a fairly simple send/receive multicast test program.
> What's the reasoning behind adding the event handling?

The librdmacm examples serve for multiple purposes, among them user education 
on how to write rdmacm based apps and as a vehicle to test/validate/reproduce 
features/bugs/issues, for example a follow program claimed that she isn't sure 
to get a multicast error event on her application when a port goes down, so 
with my patch to mckey we were able to see that this event is generated and we 
can now do better testing. In the future mckey can be further enhanced to 
rejoin,etc on either of the events, makes sense?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ipath now and then (was [PATCH] IB/core: export struct ib_port)

2009-11-12 Thread Or Gerlitz

Ralph Campbell wrote:
> I don't understand what you are suggesting.
> The kernel module name ib_ipath and/or directory name
> drivers/infiniband/hw/ipath could be reused for some
> other purpose certainly.

In a 2nd thought, its better that you go and remove the hw/ipath directory, I 
assume the qib code could be made to serve software iboe in the same manner 
ipath can, just make sure to keep the IB L2 handling in separate files from the 
L3/L4 ones...

Or
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] infiniband-diags/ibqueryerrors: Add support for PortXmitDiscardDetails

2009-11-14 Thread Or Gerlitz


Sasha Khapyorsky wrote:

I don't think this is the forum to discuss vendor bugs.


no way we can commit here a fix for undocumented bug

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RESEND] ib/iser: re-write SG handling for rdma logic

2009-11-14 Thread Or Gerlitz

Roland Dreier wrote:
> I just haven't been in a merging mode lately... will start working on my 
> 2.6.33 queue soon

So when more or less this work is going to start? it seems there are bunch of 
things on the plate for this cycle.

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 8/9] ib/addr: simplify resolving IPv4 addresses

2009-11-16 Thread Or Gerlitz

Sean Hefty wrote:
> Merge resolve local/remote address resolution into a single
> data flow to ensure consistent access and use of the local routing tables.

Sean, I reviewed patches 1-6 & 8 and they all look fine, I will give the whole 
series a try later this week to further validate them.

> Based on work from:
> David Wilder 
> Jason Gunthorpe 

David, Jason, are you planning to test these patches as well? specifically I 
assume the IPv6 work should be of interest to you...

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ewg] [PATCHv6 0/10] RDMAoE support

2009-11-18 Thread Or Gerlitz


Eli Cohen wrote:

This new series reflects changes based on feedback from the community on the 
previous set of patches, and is tagged v6. Previous series were posted to the 
openfabrics general list only.

Changes from v5:
1. Bug fixes.
How do you expect a reviewer to learn what were the bugs and what are 
the fixes and if there are bugs that are known and weren't fixed yet? is 
one expected to do a diff between patches? where is the listing of 
changes from vX for X=1,2,3,4?


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 8/9] ib/addr: simplify resolving IPv4 addresses

2009-11-19 Thread Or Gerlitz

> I reviewed patches 1-6 & 8 and they all look fine, I will give the whole 
> series 
> a try later this week to further validate them

I tested the patch series (V2 for the patches that have it, V1 for the rest) 
over 2.6.32-rc5
and librdmacm-1.0.8-1.el5 covering AF_INET/PS_TCP unicast and AF_INET/PS_IPOIB 
multicast and 
bonding (operability and address-change event). I used mckey and rping, all 
worked fine, 
thanks for driving this change set, Sean. David, I'll be happy to hear how the 
IPv6 testing went, lets get this going.



Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 7/9] rdma/cm: fix loopback address support

2009-11-24 Thread Or Gerlitz

Sean Hefty  wrote:

> I will create a new librdmacm package that corresponds with the changes

I made all my testing of the patch set with librdmacm 1.0.10 and
patched 2.6.32-rc5 kernel, where as I wrote you, I was focusing on
AF_INET/PS_TCP and AF_INET/PS_IPOIB.
I understand that Dave was covering AF_INET6/PS_TCP with plenty of the
ipv6 variations.

So what will this new librdmacm package will let cover which wasn't
possible so far? do you refer to ipv6 support in mckey? anything else?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 7/9] rdma/cm: fix loopback address support

2009-11-24 Thread Or Gerlitz

> Changes were your changes to mckey, plus changes Dave added to cmatose to
> support IPv6.  The actual library itself hasn't been modified.

okay, got it. I was under the impression that mckey still misses an
option to get from the user an ipv6 multicast address which isn't all
zeros nor unmapped, correct? or the -m option will work with both ipv4
and ipv6 addresses?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMAoE verbs questions

2009-11-24 Thread Or Gerlitz


Jeff Squyres wrote:

I was reviewing Mellanox's Open MPI patches for RDMAoE support

Hi Jeff,

Can you send us point to the patch series (mail thread or some 
repository where they sit)?


1. It looks like there is a new field on the ibv_port_attr struct: 
transport. Is it expected that all device drivers will start filling 
in this value, or is it done in the OF core code somewhere?
Please note that this field isn't present in the distro provided IB 
stack and hence it is highly recommended to avoid referring it in your 
code, as least some of us (...) are for decoupling ompi from ofed, so 
lets not put sticks in the wheels of that process.


the Open MPI RDMAOE patch implies that host loopback is not supported 
in RDMAOE mode (but it is in IB mode).  To be clear, the OMPI code had 
to do something different for real IB vs. RDMAOE in at least 1 or 2 places
Liran, where this limitation comes from? isn't the HCA supporting 
bridging (loopback connections) for RDMAoE? if this is the case maybe 
you should add a device capability to mark that.


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMAoE verbs questions

2009-11-25 Thread Or Gerlitz


Jeff Squyres wrote:
Here's one thread:  
http://www.open-mpi.org/community/lists/devel/2009/11/7063.php
Jeff, looking on the threads you have sent, I didn't find a way to 
download the patch in a form which can be applied on a source tree, is 
there a way to do it through this archive? are these patches available 
from some git tree @mellanox or elsewhere? does anyone have the email 
address of Vasily Philipov (/vasily_at_[hidden]/), if yes, can you op 
Pasha please ask him to send me or better, this list the proposed patch, 
many thanks.


Or

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMAoE verbs questions

2009-11-25 Thread Or Gerlitz


Pavel Shamis (Pasha) wrote:

The patch is attached
Thanks, this patch basically replaces checks for the device transport 
type to be IB to a check that makes sure either the former happens or 
the port transport type is rdmaoe. As Jason, Tziporet and noted, the 
port transport type seems to be bad and non-comapatible/operable idea, 
so it should and probably could be avoided.


I see another patch @ 
http://www.open-mpi.org/community/lists/devel/2009/11/7063.php
can you send that one as well. The you sent patch isn't signed so I 
can't address the author in further replies (unless you are the author), 
also it wasn't generated with the -p option of diff which would show for 
each change what is the effected function, doing so would help in the 
review.


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMAoE verbs questions

2009-11-26 Thread Or Gerlitz

Pavel Shamis (Pasha) wrote:
> The only reason for this changes is the fact that for IB devices we
> prefer to use our own open mpi connection managers. In case if we will
> decide to use RDMA-CM for all devices the number of changes will be zero...

whatever, currently, this change is still there, and best if you remove it 
and find another way to set this predicate.

> So we decided to use the current ompi code as is, in future maybe we will
> implement own ompi rdmacm code that will not have all this work around flows.

just to make sure I am with you, all in all, only one patch is proposed to ompi 
for 
rdmaoe support and is the patch which we discuss above, this patch does three 
things:

1. changes BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB to look on the port 
transport type
2. if the port transport is rdmaoe don't run loopback connections on IB
3. some change in the qp destroy logic
4. that's it...

correct? can you comment on #2? why loopback connections aren't supported?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Reliable IB connections (RC) and event ordering

2009-12-01 Thread Or Gerlitz

Roland Dreier wrote:
> The IBA takes into account this lack of ordering in multiple places -- 
> defining
> "communication established" async events, etc.

same goes for the IB stack... e.g take a look on the ib_cm_notify and 
rdma_notify APIs

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMAoE verbs questions

2009-12-02 Thread Or Gerlitz


Liran Liss wrote:

from an rdmacm app's point of view - there is no visible difference between IB 
and RDMAoE ports: both support the complete set of Verbs, just as any IB 
transport provider
  

wrong,  local (loopback) communication aren't supported  with RDMAoE.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMAoE verbs questions

2009-12-02 Thread Or Gerlitz


Paul Grun wrote:
Why do you say that Or? 
I said that b/c the latest patch set posted by Mellanox doesn't support 
loopback, I hear now that this was a temporal limitation which will be 
removed, let it be.


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: QoS settings not mapped correctly per pkey ?

2009-12-03 Thread Or Gerlitz

Yevgeny Kliteynik wrote:
> " It looks like in 'datagram' mode, the SL weights
>   do not seem to be applied, or maybe this is an
>   artifact of IPoIB in 'datagram mode' "

yes, there's no reason for connected mode to behave differently wrt to QoS/SL 
assignment from the SM, as both modes get their SL from the path record 
provided by the SM and both mode use the same code for the path query...

> Have you checked that in this mode you do get the right
> SL for each child interface by shutting off the relevant
> SL (mapping it to VL15)?

seeing what SL is provided by the SM in return to the path query is trivial, 
either through the opensm logs or the ipoib ones, e.g here you see that ib1 got 
SL 0
on its Path to GID fe80::::0008:f104:0399:3c92 LID 0x0006 which is
10.10.0.91

> ifdown ib1
> echo 1 > /sys/module/ib_ipoib/parameters/debug_level
> ifup ib1
> ping 10.10.0.91
> dmesg | grep ib1

> ib1: Start path record lookup for fe80::::0008:f104:0399:3c92 MTU 
> > 0
> ib1: PathRec LID 0x0006 for GID fe80::::0008:f104:0399:3c92
> ib1: Created ah 81021ddda180
> ib1: created address handle 81021ddda500 for LID 0x0006, SL 0

> # ip neigh show dev ib1
> 10.10.0.91 lladdr 80:00:00:49:fe:80:00:00:00:00:00:00:00:08:f1:04:03:99:3c:92 
> REACHABLE

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: InfiniBand/RDMA merge plans

2009-12-08 Thread Or Gerlitz

Roland Dreier wrote:
> Since 2.6.31-rc8 has been out more than a week already, it's probably
> a good time to talk about 2.6.32 merge plans.  All the pending things
> that I'm aware of are listed below.

Hi Roland, any update on the 2.6.33 merge plans?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 06/11] RDMA/nes: abnormal listener termination causes loopback node crash

2009-12-09 Thread Or Gerlitz


Faisal Latif wrote:

when listener is destroyed for loopback connection
Does the upstream iwarp stack supports loopback connections? does it 
apply to all vendors?


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: InfiniBand/RDMA merge plans for 2.6.33

2009-12-15 Thread Or Gerlitz


Eli Cohen wrote:

 - IBoE.  In principle I think this is starting to get there.  Still
   want to see better ABI compatibility at least, and also make sure
   the interface chosen works for both rdmacm and non-rdmacm applications.

Based on this, I am going to send a new patch set, a few days after 2.6.33-rc1 is out
Eli, here are some more issues which should be on the table and you 
might want to look at before posting a new version of the patches (or 
else if you want to handle them down the road of the review process 
that's fine)


- loopback support , Liran commented that this works, does this mean 
only firmware fix is needed?


- below-the-cover-addr-resolve-in-create-AH flow races e.g 
https://bugs.openfabrics.org/show_bug.cgi?id=1866


- L2 Ethernet integration for rdma-cm based apps, namely at minimum have 
the  gang to comply 
with packets sent by the network stack for the same IP route.


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMAoE / lossless Ethernet (ewg: SC'09 BOF - Meeting notes)


Liran Liss wrote:
>> all the rdmaoe materials saying the lossless traffic class is a 
must,  are you saying that this works well also  >> without it? then 
why from  architect point of view you have posed this requirement?


lossless traffic can be achieved today using global pause, for 
example.  PFC is still important; we will submit initial patches that 
support it next wee
Liran, I would say that OTOH global pause isn't the way to go and OTHO 
IB RC functions quite bad when many packets are lost. As such RDMAoE 
without PFC and mapping priorities into TCs (the Ethernet VLs) isn't 
really for production, for any non trivial environment involving more 
then one hop. Also, this email is from one month ago, any news on the 
patches?


Yevgeny, I took a look, and there are patches to support pfc for the 
mlx4_en driver, but they were never submitted upstream, which means that 
even if rdmaoe goes upstream, mainline users will not be able even to 
really test it. Also,  the pfc in these patches configuration seems to 
be done with sysfs and not through the Netlink APIs defined in 
include/net/dcbnl.c, did you had any specific reason not to integrate 
with the mainline method of pfc/tc configuration?


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] IB/mlx4: fix post_recv wq overflow check

the post recv flow should check wq overflow using the recv and not the send cq

Signed-off-by: Or Gerlitz 

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 989555c..2a97c96 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -1752,7 +1752,7 @@ int mlx4_ib_post_recv(struct ib_qp *ibqp, struct 
ib_recv_wr *wr,
ind = qp->rq.head & (qp->rq.wqe_cnt - 1);

for (nreq = 0; wr; ++nreq, wr = wr->next) {
-   if (mlx4_wq_overflow(&qp->rq, nreq, qp->ibqp.send_cq)) {
+   if (mlx4_wq_overflow(&qp->rq, nreq, qp->ibqp.recv_cq)) {
err = -ENOMEM;
*bad_wr = wr;
goto out;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMAoE / lossless Ethernet (ewg: SC'09 BOF - Meeting notes)

Roland Dreier  wrote:

> I agree that implementing DCB is important for IBoE, but why do you say
> that a classical ethernet fabric with global pause isn't usable?  That
> should be roughly equivalent to an IB fabric that uses only a single VL,
> which is the case for many production IB fabrics.

To start with, no matter how many data VLs are used (e.g one), all the
crucial management traffic (SMPs) go on VL15 which is on the one hand
lossy and on the other hand not subject to congestion when other VLs
are. Now how would you manage your Cisco switch --remotely-- on a
globally paused fabric when some multicast receiver hasn't had its
breakfast and now slows the sender while filling the queues throughout
the congestion tree where this switch is part of?

To continue with, lossless is good, but to make your cluster usable
under congestion, you need congestion control, that is QCN, which is
designed/optimized to the case of multiple TCs.

Also, IBoE can potentially find its way to much more complex
environments than IB has, specifically, to clusters whose hosts are
acting as hypervisors running many many VMs and the underlying fabrics
does consolidates many types of traffic, globally pausing a port can
dramatically reduce the efficiency of such computing center which
probably was built originally to increase efficiency.

I believe that the ixgbe team well understand that, and hence their
continued DCB efforts can make the combination of RXE with
Niantic/ixgbe very intresting to test.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMAoE / lossless Ethernet (ewg: SC'09 BOF - Meeting notes)

Paul Grun  wrote:
> there doesn't appear to be an argument in favor of requiring DCB with RoCEE

Interesting, the ofa server is down now, so I don't have access to ofa
IBoE materials, from my memory I recall that in ALL of them you have
made the IBoE/CEE bundling very clear & evident, e.g this  IBTA
presentation made to T11 @
http://www.t11.org/ftp/t11/pub/fc/study/09-543v0.pdf

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMAoE / lossless Ethernet (ewg: SC'09 BOF - Meeting notes)

2009-12-24 Thread Or Gerlitz


Roland Dreier wrote:

Sure, DCB is very useful, in many environments. And maybe even a requirement 
sometimes.  I'm simply trying to say that IBoE with classical ethernet is at 
least as useful as standard IB in many cases

Roland, Paul,

Putting a side for a moment the detailed discussion we've started and 
looking on the concluding remarks you have made, I wasn't sure to 
follow:  if DCB isn't available (even from a silly reason of hw 
supporting pfc but patches not being pushed to the kernel...) what you 
think would function better (or function at all) for IBoE, lossy or 
globally paused Ethernet? I haven't managed so far to convince you that 
both aren't applicable for IBoE, but I also didn't manage to see what 
are you suggesting in the absence of DCB.


Or.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMAoE / lossless Ethernet (ewg: SC'09 BOF - Meeting notes)

2009-12-24 Thread Or Gerlitz


Liran Liss wrote:

I second...
  
fair-enough, so now (A) everyone agrees that DCB is good for IBoE and 
(B) mlx4 supports pfc, any reason not to push the pfc patches into the 
kernel and have mlx4_en comply with the mainline dcbnl code?

The only way an end-node can cause congestion is if its internal buses don't 
match the IB link's BW, but this is unrelated to (lack of) transport-level flow 
control.
  

thanks for clarifying this

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IB/mlx4: fix post_recv wq overflow check

2010-01-06 Thread Or Gerlitz

Roland Dreier wrote:
> thanks, applied.

With this not being a regression, I see that it went into your for-next branch 
and as such I assume will be available by 2.6.34. Are you fine with the patch 
going into the -stable series? 

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IB/mlx4: fix post_recv wq overflow check

2010-01-07 Thread Or Gerlitz


Roland Dreier wrote:

Actually I was planning on sending it for 2.6.33, since it's so small and 
obvious and we're reasonable early in the cycle.  Not sure about -stable though 
-- has this been hit in practice?
I agree that it should go into 2.6.33, since its so small there's no 
reason to wait for 2.6.34. As for the being hit question: note that 
without there is both bug in the overflow check and creation of extra 
contention between the post recv and poll send cq flows, for ULPs that 
have their send cq different from the recv cq, e.g IPoIB, I came a cross 
this bug when reviewing the mlx4 posting code when during some profiling.


I wonder if the overflow check could be removed all together and be left 
to the ULP (kernel is trusted environment...) is there any risk in doing 
so? this way the WR posting code will not experience contention with the 
poll WC code on the CQ lock.


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv7 4/9] ib_core: RoCEE CMA device binding

2010-01-07 Thread Or Gerlitz

Eli Cohen wrote:
> +static int cma_resolve_rocee_route(struct rdma_id_private *id_priv)
[...]
> + route->path_rec->hop_limit = 2;

why? does this value has any specific meaning?

> + route->path_rec->mtu_selector = 2;

all the xxx_selector usages in this code should be 
transformed to be from the ib_sa.h selector enum.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] ib/ipoib: remove TX moderation from the ethtool related code

2010-01-11 Thread Or Gerlitz

As of commit f56bcd8 "IPoIB: Use separate CQ for UD send completions",
there are no TX interrupts at the main code path. Change the ethtool
related code to comply with this, such the users will not be misleaded
to assume they can control TX interrupt moderation. Was pointed by
Alex Vainman 

Signed-off-by: Or Gerlitz 

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c 
b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
index e9795f6..d10b4ec 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
@@ -55,9 +55,7 @@ static int ipoib_get_coalesce(struct net_device *dev,
struct ipoib_dev_priv *priv = netdev_priv(dev);

coal->rx_coalesce_usecs = priv->ethtool.coalesce_usecs;
-   coal->tx_coalesce_usecs = priv->ethtool.coalesce_usecs;
coal->rx_max_coalesced_frames = priv->ethtool.max_coalesced_frames;
-   coal->tx_max_coalesced_frames = priv->ethtool.max_coalesced_frames;

return 0;
 }
@@ -69,10 +67,8 @@ static int ipoib_set_coalesce(struct net_device *dev,
int ret;

/*
-* Since IPoIB uses a single CQ for both rx and tx, we assume
-* that rx params dictate the configuration.  These values are
-* saved in the private data and returned when ipoib_get_coalesce()
-* is called.
+* These values are saved in the private data and returned
+* when ipoib_get_coalesce() is called
 */
if (coal->rx_coalesce_usecs   > 0x ||
coal->rx_max_coalesced_frames > 0x)
@@ -85,8 +81,6 @@ static int ipoib_set_coalesce(struct net_device *dev,
return ret;
}

-   coal->tx_coalesce_usecs   = coal->rx_coalesce_usecs;
-   coal->tx_max_coalesced_frames = coal->rx_max_coalesced_frames;
priv->ethtool.coalesce_usecs   = coal->rx_coalesce_usecs;
priv->ethtool.max_coalesced_frames = coal->rx_max_coalesced_frames;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

upstream mlx4/ib/4K mtu support

2010-01-11 Thread Or Gerlitz

Hi Vlad, I came across this ofed patch which isn't upstream. Is it a must
for making mlx4/ib/4K mtu working? was it rejected from upstream? why?

Or.


mlx4/IB: Add set_4k_mtu module parameter.

It control Infiniband link MTU for all IB ports in a host.

Signed-off-by: Vladimir Sokolovsky 
---
Index: ofed_kernel-fixes/drivers/net/mlx4/port.c
===
--- ofed_kernel-fixes.orig/drivers/net/mlx4/port.c  2009-11-09 
02:20:06.0 +0200
+++ ofed_kernel-fixes/drivers/net/mlx4/port.c   2009-11-09 02:21:46.0 
+0200
@@ -37,6 +37,10 @@

 #include "mlx4.h"

+int mlx4_ib_set_4k_mtu = 0;
+module_param_named(set_4k_mtu, mlx4_ib_set_4k_mtu, int, 0444);
+MODULE_PARM_DESC(set_4k_mtu, "attempt to set 4K MTU to all ConnectX ports");
+
 #define MLX4_MAC_VALID (1ull << 63)
 #define MLX4_MAC_MASK  0xULL

@@ -308,6 +312,9 @@

memset(mailbox->buf, 0, 256);

+   if (mlx4_ib_set_4k_mtu)
+   ((__be32 *) mailbox->buf)[0] |= cpu_to_be32((1 << 22) | (1 << 
21) | (5 << 12) | (2 << 4));
+
((__be32 *) mailbox->buf)[1] = dev->caps.ib_port_def_cap[port];
err = mlx4_cmd(dev, mailbox->dma, port, 0, MLX4_CMD_SET_PORT,
   MLX4_CMD_TIME_CLASS_B);
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA Read sge errors

2010-01-11 Thread Or Gerlitz

Jack, I see now that commit cd155c1 "IB/mlx4: Fix creation of kernel QP with 
max number of send s/g entries" is mainstream but not ofed 1.4.x and that 
mlx4_0090_fix_sq_wrs.patch (below) is in ofed but not mainstream, was it 
rejected from the mainline kernel? why?

Or.


1. Limit qp resources accepted for ib_create_qp() to the limits reported
   in ib_query_device(). In kernel space,make sure that the limits
   returned to the caller following qp creation also lie within the
   reported device limits. For userspace, report as before, and
   do adjustment in libmlx4 (so as not to break ABI).

2. Limit max number of wqes per QP reported when querying the device,
   so that ib_create_qp will never fail due to any additional headroom WQEs 
allocated.

Signed-off-by: Jack Morgenstein 

---
 drivers/infiniband/hw/mlx4/main.c|2 +-
 drivers/infiniband/hw/mlx4/mlx4_ib.h |7 +++
 drivers/infiniband/hw/mlx4/qp.c  |   25 +++--
 3 files changed, 27 insertions(+), 7 deletions(-)

Index: ofed_kernel/drivers/infiniband/hw/mlx4/main.c
===
--- ofed_kernel.orig/drivers/infiniband/hw/mlx4/main.c
+++ ofed_kernel/drivers/infiniband/hw/mlx4/main.c
@@ -122,7 +122,7 @@ static int mlx4_ib_query_device(struct i
props->max_mr_size = ~0ull;
props->page_size_cap   = dev->dev->caps.page_size_cap;
props->max_qp  = dev->dev->caps.num_qps - 
dev->dev->caps.reserved_qps;
-   props->max_qp_wr   = dev->dev->caps.max_wqes;
+   props->max_qp_wr   = dev->dev->caps.max_wqes - 
MLX4_IB_SQ_MAX_SPARE;
props->max_sge = min(dev->dev->caps.max_sq_sg,
 dev->dev->caps.max_rq_sg);
props->max_cq  = dev->dev->caps.num_cqs - 
dev->dev->caps.reserved_cqs;
Index: ofed_kernel/drivers/infiniband/hw/mlx4/mlx4_ib.h
===
--- ofed_kernel.orig/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ ofed_kernel/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -44,6 +44,13 @@
 #include 
 #include 
 
+enum {
+   MLX4_IB_SQ_MIN_WQE_SHIFT = 6
+};
+
+#define MLX4_IB_SQ_HEADROOM(shift) ((2048 >> (shift)) + 1)
+#define MLX4_IB_SQ_MAX_SPARE (MLX4_IB_SQ_HEADROOM(MLX4_IB_SQ_MIN_WQE_SHIFT))
+
 struct mlx4_ib_ucontext {
struct ib_ucontext  ibucontext;
struct mlx4_uar uar;
Index: ofed_kernel/drivers/infiniband/hw/mlx4/qp.c
===
--- ofed_kernel.orig/drivers/infiniband/hw/mlx4/qp.c
+++ ofed_kernel/drivers/infiniband/hw/mlx4/qp.c
@@ -289,8 +289,9 @@ static int set_rq_size(struct mlx4_ib_de
   int is_user, int has_srq, struct mlx4_ib_qp *qp)
 {
/* Sanity check RQ size before proceeding */
-   if (cap->max_recv_wr  > dev->dev->caps.max_wqes  ||
-   cap->max_recv_sge > dev->dev->caps.max_rq_sg)
+   if (cap->max_recv_wr > dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE ||
+   cap->max_recv_sge >
+   min(dev->dev->caps.max_sq_sg, dev->dev->caps.max_rq_sg))
return -EINVAL;
 
if (has_srq) {
@@ -309,8 +310,19 @@ static int set_rq_size(struct mlx4_ib_de
qp->rq.wqe_shift = ilog2(qp->rq.max_gs * sizeof (struct 
mlx4_wqe_data_seg));
}
 
-   cap->max_recv_wr  = qp->rq.max_post = qp->rq.wqe_cnt;
-   cap->max_recv_sge = qp->rq.max_gs;
+   /* leave userspace return values as they were, so as not to break ABI */
+   if (is_user) {
+   cap->max_recv_wr  = qp->rq.max_post = qp->rq.wqe_cnt;
+   cap->max_recv_sge = qp->rq.max_gs;
+   } else {
+   cap->max_recv_wr  = qp->rq.max_post =
+   min(dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE, 
qp->rq.wqe_cnt);
+   cap->max_recv_sge = min(qp->rq.max_gs,
+   min(dev->dev->caps.max_sq_sg,
+   dev->dev->caps.max_rq_sg));
+   }
+   /* We don't support inline sends for kernel QPs (yet) */
+
 
return 0;
 }
@@ -321,8 +333,9 @@ static int set_kernel_sq_size(struct mlx
int s;
 
/* Sanity check SQ size before proceeding */
-   if (cap->max_send_wr > dev->dev->caps.max_wqes  ||
-   cap->max_send_sge> dev->dev->caps.max_sq_sg ||
+   if (cap->max_send_wr > (dev->dev->caps.max_wqes - 
MLX4_IB_SQ_MAX_SPARE) ||
+   cap->max_send_sge>
+   min(dev->dev->caps.max_sq_sg, dev->dev->caps.max_rq_sg) ||
cap->max_inline_data + send_wqe_overhead(type, qp->flags) +
sizeof (struct mlx4_wqe_inline_seg) > dev->dev->caps.max_sq_desc_sz)
return -EINVAL;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel

Re: [PATCH 1/3] rdma_cm: Add support for a new RDMA_PS_LUSTRE Lustre port space

2010-01-14 Thread Or Gerlitz

sebastien dugue wrote:
> That can be done with port numbers, except that we cannot separate
> traffic to Lustre MDS and traffic to Lustre OSS 

Looking on these patches and going with you for a minute, I don't see how this 
patch set serves you to assign a different QoS level (e.g SL) to MDS vs OSS 
related traffic. Can you elaborate on that a bit?

Sean Hefty wrote:
> Can't this be done using port numbers in the existing port space?

Indeed, Sebastien what prevents you from using the TCP port space, with one 
port used for MDS traffic and another port for OSS traffic? how does Lustre get 
ports to listen on, are they well known or you call bind with port zero and use 
the port allocated by the rdma-cm?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv7 7/9] ib_core: Add API to support RoCEE from userspace

2010-01-17 Thread Or Gerlitz

Eli Cohen wrote:
> Add ib_uverbs_get_mac() to be used by ibv_create_ah() to retirieve the remote
> port's MAC address from the remote port's GID. Port link layer is also 
> returned
> by ibv_query_port()

why can't all this be implemented within libibverbs? looking on mlx4's 
implementation of ib_get_mac, it reduces to calling rdma_get_ll_mac, a two 
liner inline function which does the translation.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv7 4/9] ib_core: RoCEE CMA device binding

2010-01-18 Thread Or Gerlitz

Eli Cohen wrote:
> The other place is IPoIB:path_rec_completion() where we need not require
> GRH since IPoIB over RoCEE is disable

please note that can't assume that IPoIB need not use GRH, as at some future 
point this code can operate across IB subnets, for couple of years patches to 
allow for supporting that are merged into the code, e.g see 46f1b3d7 "IB/ipoib: 
Use ib_init_ah_from_path to initialize ah_attr"

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

clarification on the mlx4 CQE structure

2010-01-19 Thread Or Gerlitz

Hi Yevgeny, looking on commit f780a9f "mlx4_core: Add ethernet fields
to CQE struct" I see the following two changes:

@@ -692,14 +692,13 @@ repoll:
-   wc->sl = cqe->sl >> 4;
+   wc->sl = be16_to_cpu(cqe->sl_vid >> 12);

I wasn't sure if/why a conversion from network order to host order is
neeed here, can you clarify that?

Or.


@@ -39,17 +39,18 @@
 struct mlx4_cqe {
-   __be32  my_qpn;
+   __be32  vlan_my_qpn;
-   u8  sl;
-   u8  reserved1;
+   __be16  sl_vid;


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: clarification on the mlx4 CQE structure

2010-01-19 Thread Or Gerlitz

Yevgeny Petrilin wrote:
> This commit has an endianess bug, that was fixed in commit f781a22f.
> The cqe->sl_vid field is a be16, so we needed to convert the sl value to
> host order. Before the commit this field was two u8 fields, so no conversion 
> was needed

okay, got it, thanks 

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IB/mlx4: fix post_recv wq overflow check

2010-01-19 Thread Or Gerlitz


Roland Dreier wrote:

I do think it is quite common to see this WQ overflow check trigger, even for 
kernel code
mmm, why is that common? typically there's a higher layer to which the 
IB ULP advertises some sort of maximal number of credits (e.g in the 
SCSI case, iser and srp specify the maximal number of commands in the 
scsi host template) or the ULP informs a higher layer that no more sends 
can be done (e.g IPoIB calling netif_stop_queue once it sense that the 
QP filled, etc).


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] rdma_cm: Add support for a new RDMA_PS_LUSTRE Lustre port space


sebastien dugue wrote:
So I guess you need to change the ports used within the new port space -- but then 
why can't you just stay in the TCP space but change the ports used?



No, with the new port space, there's no need to change ports. You only need to 
specify the target GUIDs. For example:
lustre, target-portguid 0x1234,0x1235 : 1 # lustre traffic to MDSs
lustre: 2 # default lustre traffic (to 
OSSs)
Hope this helps clarify things a bit.
  
sorry, but it doesn't,  as far as I understand there are three 
possibilities for what the string "lustre" is being translated to

by the opensm QoS logic:

(A) lustre port in the TCP port space
(B) lustre port space
(C) nothing (that is not a service, in the same manner that ipoib just 
doesn't mean anything to opensm)


Assuming C is not the case, then either A or B will yield the same 
result and as such the new port space buys you nothing.


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] rdma_cm: Add support for a new RDMA_PS_LUSTRE Lustre port space

sebastien dugue wrote:
>  No, because in OpenSM's QoS logic, there's no way to map the TCP port
> space with specific target GUIDs onto an SL. You have keywords for SDP, SRP,
> RDS, ISER, ... but not for the TCP port space (or am I missing something?).

going with this, what prevents you from patching opensm qos engine to support
the lustre service under the tcp port-space and/or support a combination of 
service 
and target port-guid? all in all, first, I don't see what a kernel patch buys 
you
and second, if it buys you something you should be able to gain the same effect 
with
patching open-sm.

thinking on this a bit more, since the rules are processed by order wouldn't 
the 
following scheme let you achieve the same effect?

target-portguid 0x1234,0x1235 : 1 # traffic to MDSs
lustre: 2 # default lustre traffic (to OSSs)

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IB/mlx4: fix post_recv wq overflow check


Roland Dreier wrote:

In other words this check catches common bugs and makes them a gazillion times 
easier to find and fix.  So unless the performance impact is extreme, I'm 
inclined to leave it
okay, lets leave this like that for unless someone comes with 
performance data that shows this is really a bottleneck.


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ib/ipoib: remove TX moderation from the ethtool related code


Or Gerlitz wrote:

As of commit f56bcd8 "IPoIB: Use separate CQ for UD send completions",
there are no TX interrupts at the main code path. Change the ethtool
related code to comply with this, such the users will not be misleaded
to assume they can control TX interrupt moderation. 

Hi Roland, did you had the chance to look on this one?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: rdma_bind failure over iWarp