Re: Merge process for OFED patches

2009-09-16 Thread Or Gerlitz

Bart Van Assche wrote:

I would like to contact the author of the fourth patch. But unfortunately I 
could not find any author information in that patch.
yes, non signed and  unreviewed patches is a common practice of ofed, 
does this create legal issues? maybe that would be the way to stop this?


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ipv6 support in rping

2009-09-21 Thread Or Gerlitz
Pradeep Satyanarayana wrote:

> Who will be able to help us with this? Need to include the correct level of 
> librdmacm

Sean, could you do a 1.0.9 release of librdmacm such that the ipv6 support 
could be distributed? 

Or.
 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to handle illegal multicast addresses in IPoIB?

2009-09-21 Thread Or Gerlitz
Moni Shoua wrote:
> One patch 
> (http://lists.openfabrics.org/pipermail/general/2009-August/061663.html) 
> checks each 
> multicast address for validity before it lets it get into the queue. 

isn't it the below commit which appears in Linus tree?

Or.
> commit 5e47596bee12597824a3b5b21e20f80b61e58a35
> Author: Jason Gunthorpe 
> Date:   Sat Sep 5 20:23:40 2009 -0700
> 
> IPoIB: Check multicast address format
> 
> Check that the format of multicast link addresses is correct before
> taking them from dev->mc_list to priv->multicast_list.  This way we
> never try to send a bogus address to the SA, which prevents badness
> from erronous 'ip maddr addr add', broken bonding drivers, etc.
> 
> Signed-off-by: Jason Gunthorpe 
> Signed-off-by: Roland Dreier 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible process deadlock in RMPP flow

2009-09-23 Thread Or Gerlitz
Eli Cohen wrote:
> On Wed, Sep 23, 2009 at 09:08:28AM -0700, Sean Hefty wrote:
>> What kernel does 1.4.2 map to?
> I think OFED 1.4.2 is based on kernel 2.6.27 but they're using RHEL 5.3

Yes, the usual mess: ofed X is based on kernel Y1 but with some additions from 
kernel Y2 plus plenty of unreviwed and non-merged patches. Distro Z picks ofed 
X and the result is 99% unsupportable as Roland said. Somehow this ofed 
creature is still hanging around working on the the next damage its going to 
bring into this world (code name 1.5)

Eli, here's a little tip for you, I had the displeasure to resolve bunch of 
support cases originating from the fact that the below 2 years old commit 
missed some ofed version (sorry forgot the number...), maybe it would help you 
as well?

Under a normal setting, if this commit actually solves a bug being hit by many 
costumers, someone would have opened a distro bugzilla case saying, "please 
pick this commit for your kernel", the customers would have either wait for the 
next distro update or use a distro intermediate kernel. Currently, I understand 
that distros are picking ofed versions and that's it.

Or.

commit b61d92d8ae6aa13b17d1c31e69d123879cec2ee2
Author: Sean Hefty 
Date:   Fri Nov 30 17:30:18 2007 -0800

IB/mad: Fix incorrect access to items on local_list


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] [OPENSM] update functions to match .h prototypes

2009-09-29 Thread Or Gerlitz
Stan C. Smith wrote:
> Hello,
>   The following patches address inconsistencies between header file function 
> prototypes and .c function definitions;
> missing 'const' attribute.
> Attached is a Linux EOL patch file in case a mailer hacks/reformats the text.
> 
> Signed-off-by: Stan Smith (stan.sm...@intel.com)

Stan,

The EWG list doesn't serve for development. As such, all patches should go to 
the developers list which is 

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] Re: Possible process deadlock in RMPP flow

2009-10-04 Thread Or Gerlitz

Eli Cohen wrote:
Thanks Or. This one is already in OFED 1.4.2 but apparently this is a 
different problem. Once I have information whether the patch Roland 
posted fixed it I will update the list.

Eli, did you find a commit that fixes the problem you reported on?

Or.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] [PATCH] mlx4: remove limitation on LSO header size

2009-10-04 Thread Or Gerlitz

Eli Cohen wrote:

Current code has a limitation as for the size of an LSO header not allowed to
cross a 64 byte boundary. This patch removes this limitation by setting the WQE
RR for large headers thus allowing LSO headers of any size. The extra buffer
reserved for MLX4_IB_QP_LSO QPs has been doubled, from 64 to 128 bytes,
assuming this is reasonable upper limit to header length.

Hi Eli,

Good to know that you're working on this, I assume you aim to close the 
missing pieces here e.g as you wrote me @

http://lists.openfabrics.org/pipermail/general/2008-March/048370.html


Also, this patch will cause IB_DEVICE_UD_TSO to be set only of FW versions that 
set MLX4_DEV_CAP_FLAG_BLH; e.g. FW version 2.6.000 and higher.

warning to users having an older firmware installed?


+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -103,7 +103,7 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
props->device_cap_flags |= IB_DEVICE_UD_AV_PORT_ENFORCE;
if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_IPOIB_CSUM)
props->device_cap_flags |= IB_DEVICE_UD_IP_CSUM;
-   if (dev->dev->caps.max_gso_sz)
+   if (dev->dev->caps.max_gso_sz && dev->dev->caps.flags & 
MLX4_DEV_CAP_FLAG_BLH)
props->device_cap_flags |= IB_DEVICE_UD_TSO;
So the driver doesn't use the actual value of the max_gso_sz capability, 
isn't this a bug?  the BLH bit  (any reason not the mention in the 
change-log what these three letters stand for...?) serves you to support 
large LSO headers, but isn't enough, max_gso_sz is related to the 
payload and should be used, I think.



if (dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_RESERVED_LKEY)
props->device_cap_flags |= IB_DEVICE_LOCAL_DMA_LKEY;
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 219b103..1b356cf 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -261,7 +261,7 @@ static int send_wqe_overhead(enum ib_qp_type type, u32 
flags)
case IB_QPT_UD:
return sizeof (struct mlx4_wqe_ctrl_seg) +
sizeof (struct mlx4_wqe_datagram_seg) +
-   ((flags & MLX4_IB_QP_LSO) ? 64 : 0);
+   ((flags & MLX4_IB_QP_LSO) ? 128 : 0);
64 , 128 ... here and later in build_lso_seg ,  how about defining some 
human readable something?


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] rdma/cm: allow user to specify IP to DGID mapping

2009-10-06 Thread Or Gerlitz

Sean Hefty wrote:

Provide an option for user's to manually specify the socket address to DGID 
mapping on InfiniBand.  Currently, all mappings are done using ipoib, and 
involve ARP.  This will not work across IP subnets, and alternative mechanisms 
of resolving the mapping are being explored. The latter can be more efficient 
if combined with route resolution as well.
Sean, 

If I understand correct your suggested changes are to optionally let an 
application to - instead of the following sequence of calls


rdma_resolve_addr  / addr resolved event
rdma_create_qp
rdma_resolve_route  / route resolved event
rdma_connect / cm events

do

rdma_set_ib_path
rdma_create_qp
rdma_connect / cm events

So in that respect, I am not sure how rdma_set_dest serves you. Further, 
rdma_resolve_addr does three resolutions


1. the local device and source gid
2. the PKEY (VLAN) to use
3. the destination gid

so in that respect, rdma_set_ib_path replaces both rdma_resolve_addr and
rdma_resolve_route?

I would prefer to have a solution where the app flow isn't touched, 
something like the kernel rdma-cm to communicate with the user space ACM 
daemon to get address and route resolutions.  Does such a design makes 
sense to you?



Or

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] rdma/cm: allow user to specify IP to DGID mapping

2009-10-07 Thread Or Gerlitz
Sean Hefty wrote:

> From user space, the call sequence does not change.  The user calls
> rdma_resolve_addr, rdma_resolve_route, rdma_connect, etc.  It is up to the
> librdmacm to perform the resolution.  Today, the resolution request is simply
> passed down to the kernel, which restricts how the resolution can be 
> performed.

good, fair-enough

> I kept resolving the address and route separate.  rdma_set_ib_path, which has
> always existed btw, simply sets the route/path.   The new call,
> rdma_set_ib_dest, sets the address mapping.  To use rdma_set_ib_dest, the user
> must have called rdma_bind_addr first, which covers steps 1 & 2 that you
> mentioned above.  The rdma_bind_addr call can be done internally to the
> librdmacm as part of the rdma_resolve_addr implementation.

I understand that rdma_bind_address covers the local device and vlan
resolutions, but I we should also --keep-- supporting also
applications that use an explicit source address in rdma_resolve_addr
or that don't do bind, provide src=NULL to resolve_addr and rely on
the rdma-cm to use route lookup (as the rdma_resolve_addr man page
indicates) for the device/vlan resolution.

> If a user sets the wrong address mapping or route, they should only affect 
> themselves

I wasn't sure to follow this comment, can you elaborate a bit more?

> (FYI - I have not yet implemented the librdmacm to call rdma_bind_addr as part
> of rdma_resolve_route on linux.  I did not see an easy way to convert a
> destination IP address to a source IP address.  If anyone knows how, please 
> let
> me know.)

I assume you was referring rdma_resolve_addr, correct? there should be
a way to do that from user space and if not, you can go down to the
kernel, resolve the device/vlan and then call ACM to resolve the
destination. It seems that you must resolve the dev/vlan for issuing
the ACM ARP replacement...

> >I would prefer to have a solution where the app flow isn't touched,
> >something like the kernel rdma-cm to communicate with the user space ACM
> >daemon to get address and route resolutions.  Does such a design makes
> >sense to you?

> Long term, this is exactly the type of flow that I envision.  I'd like to have
> real data to show that the ACM implementation scales first, which is part of 
> my
> problem.  I do not have the ability to easily change kernel drivers on any
> larger sized clusters.  My approach is to allow user space to perform the
> address and route resolution and pass the data to the kernel.  This way, we 
> have
> the freedom to test multiple solutions, until we can settle on what works.

I am not sure to fully follow on the easily-change-kernel-drivers
claim, isn't some change to the kernel rdma-cm being a must for the
ACM + librdmacm solution to work? suppose you have a way to fully do
the addr+route resolutions from user space, will the kernel rdma-cm
state machine will be willing to issue

rdma_create_id
rdma_set_ib_path (you said this exists today?)
rdma_create_qp
rdma_connect

???

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] rdma/cm: allow user to specify IP to DGID mapping

2009-10-08 Thread Or Gerlitz
On Thu, Oct 8, 2009 at 1:42 AM, Sean Hefty  wrote:
> My intent, which differs from Jason's, was to fully support the existing
> librdmacm interfaces as they are defined.

yes, I agree this is the way to go

> Implementation wise, if the user of the librdmacm calls rdma_resolve_addr 
> with a
> src address, it's easy.  Without the src address, it's hard, but I may just be
> missing some easy interface for finding the src address.

note that dst can map to multiple src addresses, so you're just
looking for one of them... its doable, I will get you the details if
you still need them


>>> If a user sets the wrong address mapping or route, they should only affect 
>>> themselves

>> I wasn't sure to follow this comment, can you elaborate a bit more?

> I meant that if some bogus app wants to specify an IP to GID mapping that's
> invalid, the incorrect mapping should only affect connections for that app.

yes, this makes sense and I believe the rdma-cm code is written such
that one bugus ID doesn't leak its defections to other IDs

> I can somewhat implement an ACM + librdmacm solution entirely in user space by
> layering the librdmacm over libibcm.  Because of the event reporting, it 
> would be limited
> in how it could be use, and is unlikely to be something that would ever be 
> supported.

yes, it would be limited and not really supportable, going that way
for research / experimentation and development is fine, just make sure
to never release that...

> Technically, rdma_resolve_addr could remain unchanged, in which case it will 
> do
> everything it does today, which may include sending an ARP.  This is the
> specific operation that I'd like to avoid.

again, apps (both user and kernel ones) do use rdma_resolve_addr and
we want them to keep doing so (I thought we agreed on that). For
staging you may develop the type II address resolution prototype on
top of libibcm but later rdma_resolve_addr would call IBACM and then
sync with the kernel.

Basically, can we agree that rdma_resolve_addr(src, dst, timeout) of
type II it would look like

if (src)
  rdma_bind(src)
else
   call_some_user_space_networking_api_to_convert_dst_to_netdev/src
next,  now we have dev/pkey
- call ACM to resolve dst IP to GID and use dev/pkey for that
- sync the kernel rdma_cm with the resolution if needed for the
state-machine (hopefully its not a must at this point and can be done
when calling set_path).

Or
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] rdma/cm: allow user to specify IP to DGID mapping

2009-10-08 Thread Or Gerlitz
Sean Hefty  wrote:
>>When used over IB, the IP address is little more than a qualifier contained
>>within the IB CM REQ private data.
>
> If we added support for AF_GID/AF_IB to the kernel, the rdma_cm could leave 
> all
> of the private data carried in the IB CM REQ entirely up to the user.  If the
> user happens to format that data to look like the CMA header, so be it.  I
> believe this would allow for a 'clean' implementation of rdma_resolve_addr,
> preserve the ABI, and still allow a library to provide backwards 
> compatibility.

Sean,

So in this design librdmacm will change the user supplied AF_XXX in
the provided sock address and set it to AF_GID/IB, sounds okay.

> Would this approach combined with the ability to set the route work for 
> everyone?

yes, it makes sense.

However, I don't manage to follow on your port space discussion with
Jason. Some apps may have client in user space and server in the
kernel or vise versa. I wouldn't tie PS_IB or a like with ACM. The ACM
ARP replacement protocol will reply only if the ip address specified
in the broadcast request is an ip of this host on that pkey and a port
connected to that fabric, correct?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] rdma/cm: allow user to specify IP to DGID mapping

2009-10-08 Thread Or Gerlitz
Jason Gunthorpe  wrote:

> If the listening side continues to use the IP mode to listen then I guess the
> client can compute an appropriate service ID, but it seems a bit
> strange for one side to use IP and the other side to use the ACM
> method? I was imagining you'd configure both sides to use the same  method.

1st, unlike IP, in IB only the active/connecting side does address
resolution, 2nd, the listener may be in the kernel where the active
may be user space, but anyway, ACM is an alternative way to do
destination gid resolution and path query emulation, I don't see what
it has to the with the CM protocol expect for keeping things the way
they were in this respect (rdma-cm IP header in the REQ, etc).

I don't see why if someone is resolving address through ACM they
aren't PS_TCP consumers.

Or
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] rping is not resolving ipv6 addresses

2009-10-09 Thread Or Gerlitz
David J. Wilder  wrote:
> I added an option to rping to specify a source address and supply it to

patch?

> rdma_resolve_addr(), but now it is failing rdma_resolve_route().
> $ ./rping -d  -c -v -a fe80::202:c903:1:1925 -i fe80::202:c903:1:28ed
> cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0x100213d0 (parent)
> cma_event type RDMA_CM_EVENT_ROUTE_ERROR cma_id 0x100213d0 (parent)
> cma event RDMA_CM_EVENT_ROUTE_ERROR, error -22

what does the neighbour info (ip neigh show | grep 1925)  shows after
running rping?
can you do ipoib ping and ping6 to the fe80::202:c903:1:1925 host?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] rping is not resolving ipv6 addresses

2009-10-09 Thread Or Gerlitz
David J. Wilder  wrote:

> If I run rping without my rping change to add the source address to
> rdma_resolve_address(),  ip neigh show gives:  fe80::202:c903:1:1925 dev eth1 
>  FAILED
> Notice that interface is incorrect, it should be ib0. tcpdump showed the
> neighbor-discovery sent out the eth0 interface.

yes, this is as of what Roland explained.


> Running with my rping change to specify the local-link address of my ib0
> interface "ip neigh show" never shows any entry for  fe80::202:c903:1:1925

mmm... weird, run your rping with tcp dump in another screen and see
if ND takes place

> ping6 will work but I must specify the interface to use: ping6 
> fe80::202:c903:1:1925%ib0

after the tcpdump experiment, run ping6 and immediatly following that
or in parallel on another window run rping
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rping is not resolving ipv6 addresses

2009-10-11 Thread Or Gerlitz
Sean Hefty wrote:
> The rdma cm was never fully coded or tested for ipv6 support.

Sean, even if not fully coded/tested, some work has been done, e.g commits 
38617c64 "RDMA/addr: Add support for translating IPv6 addresses" and 1f5175ad 
"RDMA/cma: Add IPv6 support". I suggest we'll try to see what does it take to 
make this better or even fully support ipv6.

Jason, can you restate what are the two problems you saw from David's reports? 
the 1st was related to scope in link-local addresses, and what's the 2nd?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OpenSM Failover

2009-10-12 Thread Or Gerlitz

Yevgeny Kliteynik wrote:
There was a hand-over problem in OFED 1.4, but later it turned  out to 
be FW issue. The thing is, FW version 2.6.648 doesn't  have this bug 
any more...
so things should work fine with the newly released 2.7 firmware? if this 
is still under question, Aaron, I suggest you open a bugzilla case @ 
https://bugs.openfabrics.org and we can track from there.


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] rdma/cm: support option to allow manually setting IB path

2009-10-13 Thread Or Gerlitz

Sean Hefty wrote:

Before spending any more time on this patch series, is there any disagreement to
accepting this patch (as is or slightly modified) upstream?

Hi Sean,

This patch just sets a route to the kernel and have the kernel issue a 
route resolved event in return, sounds good to me, I don't see any 
problem with merging it upstream.


However, we still have a discussion to continue on the slightly bigger 
picture which is related to how address resolution is "set" to the 
kernel, what port spaces would be supported, etc, and this discussion is 
somehow gets closer to the ACM design... lets continue with that on the 
"rdma/cm: allow user to specify IP to DGID mapping" thread


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: switching the active interface for bonding

2009-10-14 Thread Or Gerlitz

Sumeet Lahorani wrote:
We are [...] trying to simulate the effect of a bonding failover 
initiated by a switch failure using echo commands in parallel to the 
/sys/class/net/bond0/bonding/active_slave file on a few of the nodes 
attached to the switch. Is this an acceptable technique?

yes

We are trying to avoid actually resetting the switch to avoid 
affecting other nodes connected to the same switch, since the other 
nodes are being used for other purposes
There's no need to reboot a switch in order to cause an IB link down 
event on an HCA port across the wire connected to one of the switch 
ports. You can administratively disable the switch port you want and 
later administratively enable it. This is simple as


   $ ibportstate disable/query/enable $LID $PORT

using the switch one and the switch port the hca port is connected to.

Would there be any difference in terms of the code path which the 
bonding driver/ofed stack follows when we do this as opposed to 
resetting the switch?

yes and yes.

Bonding wise, when setting the active slave through sysfs, the bonding 
driver doesn't go through the link monitoring code, wheres if you do 
cause a link down it does.


As for the IB stack (there's nothing like "ofed stack", ofed is just a 
bunch of rpms installed over your distro), when a port goes down, things 
happen...  if the software you're using counts/uses IB port down events, 
you may exercise a different flow, e.g IPoIB is using these events, and 
you will not go through the port down flow of it. Next, if some code 
you're working with uses the IB RC transport, then depending on the 
timeout programmed to the RC QP, a transport timeout may happen which in 
turn causes the HW to move the QP into the error state, and so on.


Or
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: switching the active interface for bonding

2009-10-14 Thread Or Gerlitz

Sumeet Lahorani wrote:

We are using OFED 1.4.2
Please note that the bonding driver provided by the latest distros 
supports IPoIB. So if your distro happen to be RHEL 5.4 (or its OEL 5.4 
derivative), or SLES11 you can and should use the distro provided 
bonding. Moving forward, OTOH customers would use only distro code and 
OTOH bonding will be moved out from ofed, so best  if you better start 
working now as things will be outside anyway.


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] rdma/cm: allow user to specify IP to DGID mapping

2009-10-20 Thread Or Gerlitz

Sean Hefty wrote:

From the perspective of IB, the RDMA CM simply defines a specific format to 
private data and service ID carried in the IB CM REQ.  As long as any use 
adheres to that protocol, interoperability won't be an issue.
okay, I just wanted to make sure that the whole thing (ACM + modified 
librdmacm + modifed rdma-cm) is applicable AND inter-operable for 
AF_INET / PS_TCP applications.
Looking on kernel cma.c format_hdr code it first branches on the address 
family and next of the port space.


Going with your proposed flow, I understand that an app call to 
rdma_resolve_addr will be broken down to rdma_bind_addr, ACM resolution 
of the destination GID and then rdma_set_ib_dest, so things should work 
perfect for AF_INET / PS_TCP apps, correct?


The only missing piece here is the route lookup from user space for 
applications not specifying a source address in their rdma_resolve_addr 
invocation, do you still need help to implement that?



Essentially, the RDMA CM interface would become capable of connecting to any IB 
application.  (I really haven't thought through the details yet, and the 
addition of RDMA_PS_IB shouldn't be part of the initial patch submission.)


fair-enough, I just wanted to make sure with you that AF_IB / PS_IB 
aren't tightly coupled with the proposed change and you have clarified that.



The ACM responds based on a configuration file.  The ib_acme utility can create 
that file using the active IP, pkey, port information of the system, but the 
current ACM implementation does not adjust to dynamic changes or detect 
misconfigurations or other made up words.
I see. Does the new flow of librdmacm is going to be under new API, eg 
rdma_resolve_addr/route_ext  or the same API optionally talking to ACM 
through some IPC if the ACM daemon is running, or something else?


Or.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] rdma/cm: support option to allow manually setting IB path

2009-10-20 Thread Or Gerlitz

Or Gerlitz wrote:

sounds good to me, I don't see any problem with merging it upstream.

Hi Sean,

Are you moving forward with these patches to 2.6.33 ?

Or
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ib/iser: re-write SG handling for rdma logic

2009-10-20 Thread Or Gerlitz
After dma-mapping an SG list provided by the SCSI midlayer, iser has
to make sure the mapped SG is "aligned for RDMA" in the sense that its
possible to produce one mapping in the HCA IOMMU which represents the
whole SG. Next, the mapped SG is formatted for registration with the HCA.

This patch re-writes the logic that does the above, to make it clearer
and simpler. It also fixes a bug in the being aligned for rdma checks,
where a "start" check wasn't done but rather only "end" check.

Signed-off-by: Alexander Nezhinsky 
Signed-off-by: Or Gerlitz 

Index: linux-2.6.32-rc5/drivers/infiniband/ulp/iser/iser_memory.c
===
--- linux-2.6.32-rc5.orig/drivers/infiniband/ulp/iser/iser_memory.c
+++ linux-2.6.32-rc5/drivers/infiniband/ulp/iser/iser_memory.c
@@ -209,6 +209,8 @@ void iser_finalize_rdma_unaligned_sg(str
mem_copy->copy_buf = NULL;
 }

+#define IS_4K_ALIGNED(addr)unsigned long)addr) & ~MASK_4K) == 0)
+
 /**
  * iser_sg_to_page_vec - Translates scatterlist entries to physical addresses
  * and returns the length of resulting physical address array (may be less than
@@ -221,62 +223,52 @@ void iser_finalize_rdma_unaligned_sg(str
  * where --few fragments of the same page-- are present in the SG as
  * consecutive elements. Also, it handles one entry SG.
  */
+
 static int iser_sg_to_page_vec(struct iser_data_buf *data,
   struct iser_page_vec *page_vec,
   struct ib_device *ibdev)
 {
-   struct scatterlist *sgl = (struct scatterlist *)data->buf;
-   struct scatterlist *sg;
-   u64 first_addr, last_addr, page;
-   int end_aligned;
-   unsigned int cur_page = 0;
+   struct scatterlist *sg, *sgl = (struct scatterlist *)data->buf;
+   u64 start_addr, end_addr, page, chunk_start = 0;
unsigned long total_sz = 0;
-   int i;
+   unsigned int dma_len;
+   int i, new_chunk, cur_page, last_ent = data->dma_nents - 1;

/* compute the offset of first element */
page_vec->offset = (u64) sgl[0].offset & ~MASK_4K;

+   new_chunk = 1;
+   cur_page  = 0;
for_each_sg(sgl, sg, data->dma_nents, i) {
-   unsigned int dma_len = ib_sg_dma_len(ibdev, sg);
-
+   start_addr = ib_sg_dma_address(ibdev, sg);
+   if (new_chunk)
+   chunk_start = start_addr;
+   dma_len = ib_sg_dma_len(ibdev, sg);
+   end_addr = start_addr + dma_len;
total_sz += dma_len;

-   first_addr = ib_sg_dma_address(ibdev, sg);
-   last_addr  = first_addr + dma_len;
-
-   end_aligned   = !(last_addr  & ~MASK_4K);
-
-   /* continue to collect page fragments till aligned or SG ends */
-   while (!end_aligned && (i + 1 < data->dma_nents)) {
-   sg = sg_next(sg);
-   i++;
-   dma_len = ib_sg_dma_len(ibdev, sg);
-   total_sz += dma_len;
-   last_addr = ib_sg_dma_address(ibdev, sg) + dma_len;
-   end_aligned = !(last_addr  & ~MASK_4K);
+   /* collect page fragments until aligned or end of SG list */
+   if (!IS_4K_ALIGNED(end_addr) && i < last_ent) {
+   new_chunk = 0;
+   continue;
}
+   new_chunk = 1;

-   /* handle the 1st page in the 1st DMA element */
-   if (cur_page == 0) {
-   page = first_addr & MASK_4K;
-   page_vec->pages[cur_page] = page;
-   cur_page++;
+   /* address of the first page in the contiguous chunk;
+  masking relevant for the very first SG entry,
+  which might be unaligned */
+   page = chunk_start & MASK_4K;
+   do {
+   page_vec->pages[cur_page++] = page;
page += SIZE_4K;
-   } else
-   page = first_addr;
-
-   for (; page < last_addr; page += SIZE_4K) {
-   page_vec->pages[cur_page] = page;
-   cur_page++;
-   }
-
+   } while (page < end_addr);
}
+
page_vec->data_size = total_sz;
iser_dbg("page_vec->data_size:%d cur_page %d\n", 
page_vec->data_size,cur_page);
return cur_page;
 }

-#define IS_4K_ALIGNED(addr)unsigned long)addr) & ~MASK_4K) == 0)

 /**
  * iser_data_buf_aligned_len - Tries to determine the maximal correctly aligned
@@ -284,42 +276,40 @@ static int iser_sg_to_page_vec(struct is
  * the number of entries which are aligned correctly. Supports the case where
  * consecutive 

Re: [PATCH] librdmacm: initialize correct pthread condition in rdma_join_multicast

2009-10-22 Thread Or Gerlitz

Sean Hefty wrote:

rdma_join_multicast re-initializes id_priv->cond rather than mc->cond. Fix 
this.  Bug reported by Nir Naaman

any idea what's the impact of this bug?

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] rdma/cm: allow user to specify IP to DGID mapping

2009-10-22 Thread Or Gerlitz

Sean Hefty wrote:

okay, I just wanted to make sure that the whole thing (ACM + modified librdmacm 
+ modifed rdma-cm) is applicable AND inter-operable for AF_INET / PS_TCP 
applications

I do not intend to have any changes that break anything
my question went beyond whether things are going to be broken (they 
aren't as you said), but rather will ACM is going to be 
***applicable***  for AF_INET/PS_TCP application. From your reply and 
the discussion that followed between you and Jason, I got the impression 
that the answer is "not really" b/c if for example the server side 
thinks it would be getting the IP address of the connecting side in the 
REQ private header, once this REQ was sent in the flow of AF_INET which 
was converted to AF_IB, this is not going to happen. Moreover, if the 
SID constructed by AF_INET / PS_TCP call to rdma_resolve_address which 
uses the librdmacm-ACM flow wouldn't match the SID constructed in the 
passive side which didn't use this flow (e.g user --> kernel or kernel 
--> user app), the REQ wouldn't be getting anywhere and be rejected by 
the CM on the passive side :(



Going with your proposed flow, I understand that an app call to 
rdma_resolve_addr will be broken down to rdma_bind_addr, ACM resolution of the 
destination GID and then rdma_set_ib_dest, so things should work perfect for 
AF_INET / PS_TCP apps, correct?

This is my current plan for the kernel: export rdma_set_ib_paths to user space. 
 Submit a patch.  Get it accepted upstream.  Eat ice cream to celebrate.
again, rdma_set_ib_path for itself is quite innocent... as I wrote you 
couple of days ago, it can be merged anytime, the big thing is the bind 
/ address resolution modified flow which effects the connect/listen, 
etc.  So just for this patch, I would go on a small size ice-cream, 
where once the design for the bigger picture is in place, go for a pint...



Define AF_IB and struct sockaddr_ib (contains a gid and service id).  Update 
rdma_bind_addr, rdma_resolve_addr, and rdma_connect to handle AF_IB. 
rdma_bind_addr fills in the sid according RDMA IP CM service annex. 
rdma_resolve_addr just needs to save the GIDs. rdma_connect will not modify the 
private data in the CM REQ for AF_IB.
I really tried to follow the thread between you and Jason with quite 
little success, and I am going to give it more tries... in parallel, 
could you help me understand what is the --drive/reasoning-- from your 
perspective to add AF_IB / PS_IB here?


I believe that the suggestion I brought of: converting rdma_resolve_addr 
with null src addr to route lookup and following that rdma_bind_addr, 
with a similar/same flow for rdma_resolve_addr with src address, next do 
the ACM dgid resolution, call the rdma_set_dgid call. Would allow to 
serve AF_INET / PS_TCP with ACM.


If from other reasons, people want the rdma-cm to support AF_IB and/or 
PS_IB, we can do that as well, but why force doing that behind the cover 
each time ACM is used?!


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] rdma/cm: allow user to specify IP to DGID mapping

2009-10-25 Thread Or Gerlitz

Sean Hefty wrote:

These are the things done today in the kernel wrt IB:
* Map a local or remote IP address to a GID
* If a local address is not given, provide a usable address based on the 
destination address
* Acquire a path between the source and destination
* Format the first 36 bytes of private data in the CM REQ
Any or all of these could be done in user space instead.  Adding AF_IB to the 
kernel can provide a clean way of enabling this.  It can also allow full 
support of IB CM functionality through the RDMA CM interfaces

Sean,

First, on top of what you have mentioned above, the kernel also 
generates the SID to connect to / listen on, maintains a "binding" 
(mapping) between an rdma-cm id to a netdevice which today is used for 
generating address change events, and maybe some more tasks which I 
neither of us brought. From what you write here I understand that the 
reasoning is something like:


1. we can do all this in user space
2. for that end AF_INET/PS_TCP flow has to be converted to AF_IB/PS_IB 
behind the cover


well, you didn't address some of my comments (not the ice-cream 
ones...), which come to say that this wouldn't be inter-operable if for 
one side you convert INET/TCP to IB/IB and for the other side you don't 
(e.g userA/userB user/kernel kernel/user etc schemes). Also the 
functionality added under the bonding scheme is lost, etc.


I am asking you to have INET/TCP apps enjoy both ACM's DGID and route 
resolution without being converted to IB/IB, simple as that. If needed 
I'd be happy to assist in making this flow happen.


The rdma-cm was born first and most to serve as a glue between the IP 
and RDMA worlds, and I just ask you, as the maintainer, to keep this 
well-going-glue happening also under ACM.


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] rdma/cm: allow user to specify IP to DGID mapping

2009-10-25 Thread Or Gerlitz

Jason Gunthorpe wrote:

So why not have a more general, flexible approach? Isolating ACM from librdmacm 
by using AF_IB is a good idea, it keeps them seperate and lets ACM and future 
go where ever. I hope Sean can make it work with the rdma_getddrinfo idea, that 
would completely seperate ACM and librdmacm
Generally speaking, AF_IB/PS_IB sounds okay to me, even though I am not 
clear what applications are going to use it, maybe some examples please?

Attempting to bake it into AF_INET means that librdmacm, possibly the kernel 
and maybe even the apps need to be contaminated with ACM specific code, and 
that just doesn't seem desirable to me. What happens when someone invents BCM 
or CCM? More mess
I don't agree, the only place where librdmacm goes to ACM is to resolve 
DGID and a route. This can be done with rdma_getdgidinfo & 
rdma_getrouteinfo if you like (maybe you do the implementation?), or 
with ACM (later BCM, CCM) plugin used by librdmacm or by calls from 
librdmacm to ACM. But in any case, the kernel code nor the app will not 
be contaminated.


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] link-local address fix for rdma_resolve_addr

2009-10-25 Thread Or Gerlitz

Jason Gunthorpe wrote:

Mainly for RDMA what you get is more kernel flexability, IP like
capabilities, better bonding, and better support of IB APM semantics:
 - bind() + listen() actually works properly if more than one
   interface is bound to the same IP - the cm_id returned by accept is
   bound to the hca and port that accepted the connection
   [ This is a L3 form of bonding Linux supports ]

   This is actually something of a mandatory notion to implement the
   full generality of the IB CM protocol which allows the CM REP to
   contain a port GUID of another port on the same node (multi-port
   APM is an IB feature). So you never know what port the accept()
   result will get bound to.

   BTW: I suppose ideally AF_IB would have a way to say 'CM accept REPs on
   any port on this node' Hmm, reserved GID prefix perhaps? Hmm.

   When used with bonding this would also afford the kernel with the
   ability to accept incoming connections across all the redundant
   RDMA devices - and still have correct bound-to-IP semantics.

 - rdma_resolve_addr more or less as the inverse of all the above
   * multiple interfaces with same IP case works, kernel and routing
 table can distribute outgoing connections
   * multi-port APM works, kernel and user space can choose primary
 and backup port for the IP addy
   * bonding works, kernel can balance outgoing connections across the
 bond slaves.

These are all useful features.


Jason,  Have you even looked into or tested any of the bonding 
load-balancing modes with ipoib? some/most of them are not applicable to 
IPoIB and I don't think that the ones which may be such were ever 
tested. Next, multiple interfaces with the same ip address isn't 
something I see very useful for production environment (but I'd be happy 
to get educated what L3 bonding is and where it can play), next, 
multi-port APM isn't something I ever heard to be required by customers 
and more important, from comments made by Sean in the past, I don't 
think it fits the rdma-cm spirit.


All in all, someone comes here and suggests some fixes to the rdma-cm 
address resolution code to have IPv6 work. I don't think Dave should 
carry on his back/patch all your proposed future enhancements.  Let him 
fix things and following that you can work on the patches to support all 
these nice/nitch features starting with IPv4 and then IPv6.


Or

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] link-local address fix for rdma_resolve_addr

2009-10-27 Thread Or Gerlitz

Jason Gunthorpe wrote:

I was saying that point in the rdmacm where the rdma_cm_id is bound to a local 
RDMA device should have only been rdma_resolve_addr and rdma_accept. 
Overloading rdma_bind_addr to both bind to an IP and bind to an RDMA device was 
a bad API choice.
As you wrote, for the most case, binding comes into play only for users 
calling rdma_resolve_address or rdma_accept, for users the need explicit 
binding the rdma-cm provides rdma_bind and it binds to both IP and 
Device, if you can do better, send a patch, binding can't be removed 
from the API since it has users and it makes sense  from users to 
require it.

Sean is right, there may be special cases that require an early binding, but a 
seperate API - like IP's SO_BINDTODEVICE - would has been better - and users 
are forewarned that calling it restricts the environments their app will support

its just naming, will $ sed s/rdma_bind/rdma_set_opt(RDMA_BINDTODEVICE)/g
make you happier? why?

As it stands we have several impossible situations. Sean, Dave, and I were disucssing the trades offs of what this means relative to IP route resolution 
Don't tell me that Dave's patches are blocked b/c you discovered the 
rdma_bind design and now you don't like it, as I wrote you, Dave sent 
patch to fix the IPv6 support, during the discussion on his patches you 
come and bring up more and more issues you consider as problems (but I 
don't) and block the patch set, I don't think this is appropriate. Let 
the patches go and send your patches to fix the problems you see. Why 
anyone touching some code piece has to fix problems you see in that piece?!




- but it affects bonding too. If you rdma_bind_addr to the IP of a bonding 
device, the stack must pick one of the local RDMA ports immediately. If you 
then call rdma_listen there is a problem: incoming connections may target 
either RDMA device, but you are only bound to one of them. An app cannot say 'I 
want to listen on this IP, any RDMA device' with the current API, as you can in 
IP, and that is a shame
An app can say, I want to listen on that IP and the RDMA device which is 
associated with this IP now. When bonding does fail-over it generated a 
netevent, the rdma-cm catches this event and generates address change 
event, apps can redo their bind/listen at this point. For the time 
being, we never got a user report on a problem, people are doing listen 
on all IPs probably which works perfect with bonding. Currently the HA 
mode of bonding will respond on ARP only on one of the devices and as 
such connection requests will not target any rdma device but rather only 
the active one. If this is such a shame, send fix, spraying mud on the 
maintainer and/or someone sending another patch is a shame, isn't it?




Traditionally with ethernet the L2 bonding is really only used for link 
aggregation, L1 failure, and a simple multi-switch HA scheme. It is not 
deployed if you have multiple ethernet domains. Some people prefer to have 
dual, independent ethernet fabrics, and in that case you rely on routing 
features to get the multipath, and HA features of bonding.

okay, thanks for the crash course.


Go back on the list and look up the posts from Leo who first discovered this, 
what he was trying to do is kinda the L3 bonding approach.
if Loe has a problem and you want to help him, bring it on the list, 
debate, send patches, jumping into someone's else patch isn't the 
constructive way to go.



David has been doing a good job and I am glad he is working on the IPv6 
support. My comments are only intended to clarify how this is all supposed to 
work and why the IP flow is actually still relevant to RDMA connections.

As I see it, your comments block the the patches sent by Dave, Sean?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] link-local address fix for rdma_resolve_addr

2009-10-28 Thread Or Gerlitz

Jason Gunthorpe wrote:

Wow, seriously? You do understand the purpose of review, right?
I think I do, maybe not to the depth you and your arguments are, but 
again, repeating myself: my kind of simple argument is that your review 
is way beyond the --change-- suggested by a patch but rather of a whole 
logic, and you block a patch b/c you don't like the logic this patch 
integrates with. To some extent such practice is excepted, but you took 
it to way beyond acceptable limit. I don't accept your assertion that 
the whole logic is broken and it makes sense to me to have a patch from 
Dave to fix the IPv6 part of it. Next or in parallel you are welcome to 
sent a patch fixing/re-writing the whole bind logic or even the whole 
rdma stack or the whole kernel.



And yes, actually, accounting for how rdma_bind() is different from bind() when 
doing route resolution is pretty much the main remaining problem

go and fix that

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RDMA] Fixup IPv6 support and IPv4 routing corner cases for RDMA CM

2009-10-28 Thread Or Gerlitz
Jason Gunthorpe wrote:
> **COMPILE TESTED ONLY**

any reason why other people have to test for you?

> Convert the address resolution process for outgoing connections
> to be very similar to the way the TCP stack does the same operations.
> This fixes many corner case bugs that can crop up.

rdma_join_multicast(3) states that "before  joining  a  multicast  group,  the  
rdma_cm_id  must  be  bound to an RDMA device by calling rdma_bind_addr or 
rdma_resolve_addr", please make sure that this flow isn't broken by your patch.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] opensm: Add initial support for optimized SLtoVLMappingTable programming

2009-10-31 Thread Or Gerlitz

Hal Rosenstock wrote:

On Thu, Oct 29, 2009 at 10:23 PM, Sasha Khapyorsky  wrote:

Implementation description would be very useful. What does "initial support" 
mean?

It means there's more to come in terms of using 
OptimizedSLtoVLMappingProgramming. This is the simplest use/introduction of 
this optional feature.
You can just send people to reads specs, your change log should explain 
what the patch is about, if this is a big change to opensm, maybe even 
RFC it will a detailed writeup


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RDMA] Fixup IPv6 support and IPv4 routing corner cases for RDMA CM

2009-10-31 Thread Or Gerlitz

Jason Gunthorpe wrote:

On Wed, Oct 28, 2009 at 10:05:19AM -0700, Sean Hefty wrote:

A UD endpoint can communicate using multicast and to other UD endpoints.  A 
user could resolve a UD endpoint before joining a multicast group.


So the IP world analog would be:
fd = socket(AF_INET,SOCK_DGRAM);
connect(fd,'Some Unicast Address');
setsockopt(fd,IP_MULITCAST_ADD_MEMBERSHIP,'Some Multicast Address');
sendto(fd,...,'Some Multicast Address');

IP multicast senders don't call IP_ADD_MEMBERSHIP, only receivers

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RESEND] ib/iser: re-write SG handling for rdma logic

2009-10-31 Thread Or Gerlitz
After dma-mapping an SG list provided by the SCSI midlayer, iser has
to make sure the mapped SG is "aligned for RDMA" in the sense that its
possible to produce one mapping in the HCA IOMMU which represents the
whole SG. Next, the mapped SG is formatted for registration with the HCA.

This patch re-writes the logic that does the above, to make it clearer
and simpler. It also fixes a bug in the being aligned for rdma checks,
where a "start" check wasn't done but rather only "end" check.

Signed-off-by: Alexander Nezhinsky 
Signed-off-by: Or Gerlitz 

Index: linux-2.6.32-rc5/drivers/infiniband/ulp/iser/iser_memory.c
===
--- linux-2.6.32-rc5.orig/drivers/infiniband/ulp/iser/iser_memory.c
+++ linux-2.6.32-rc5/drivers/infiniband/ulp/iser/iser_memory.c
@@ -209,6 +209,8 @@ void iser_finalize_rdma_unaligned_sg(str
mem_copy->copy_buf = NULL;
 }

+#define IS_4K_ALIGNED(addr)unsigned long)addr) & ~MASK_4K) == 0)
+
 /**
  * iser_sg_to_page_vec - Translates scatterlist entries to physical addresses
  * and returns the length of resulting physical address array (may be less than
@@ -221,62 +223,52 @@ void iser_finalize_rdma_unaligned_sg(str
  * where --few fragments of the same page-- are present in the SG as
  * consecutive elements. Also, it handles one entry SG.
  */
+
 static int iser_sg_to_page_vec(struct iser_data_buf *data,
   struct iser_page_vec *page_vec,
   struct ib_device *ibdev)
 {
-   struct scatterlist *sgl = (struct scatterlist *)data->buf;
-   struct scatterlist *sg;
-   u64 first_addr, last_addr, page;
-   int end_aligned;
-   unsigned int cur_page = 0;
+   struct scatterlist *sg, *sgl = (struct scatterlist *)data->buf;
+   u64 start_addr, end_addr, page, chunk_start = 0;
unsigned long total_sz = 0;
-   int i;
+   unsigned int dma_len;
+   int i, new_chunk, cur_page, last_ent = data->dma_nents - 1;

/* compute the offset of first element */
page_vec->offset = (u64) sgl[0].offset & ~MASK_4K;

+   new_chunk = 1;
+   cur_page  = 0;
for_each_sg(sgl, sg, data->dma_nents, i) {
-   unsigned int dma_len = ib_sg_dma_len(ibdev, sg);
-
+   start_addr = ib_sg_dma_address(ibdev, sg);
+   if (new_chunk)
+   chunk_start = start_addr;
+   dma_len = ib_sg_dma_len(ibdev, sg);
+   end_addr = start_addr + dma_len;
total_sz += dma_len;

-   first_addr = ib_sg_dma_address(ibdev, sg);
-   last_addr  = first_addr + dma_len;
-
-   end_aligned   = !(last_addr  & ~MASK_4K);
-
-   /* continue to collect page fragments till aligned or SG ends */
-   while (!end_aligned && (i + 1 < data->dma_nents)) {
-   sg = sg_next(sg);
-   i++;
-   dma_len = ib_sg_dma_len(ibdev, sg);
-   total_sz += dma_len;
-   last_addr = ib_sg_dma_address(ibdev, sg) + dma_len;
-   end_aligned = !(last_addr  & ~MASK_4K);
+   /* collect page fragments until aligned or end of SG list */
+   if (!IS_4K_ALIGNED(end_addr) && i < last_ent) {
+   new_chunk = 0;
+   continue;
}
+   new_chunk = 1;

-   /* handle the 1st page in the 1st DMA element */
-   if (cur_page == 0) {
-   page = first_addr & MASK_4K;
-   page_vec->pages[cur_page] = page;
-   cur_page++;
+   /* address of the first page in the contiguous chunk;
+  masking relevant for the very first SG entry,
+  which might be unaligned */
+   page = chunk_start & MASK_4K;
+   do {
+   page_vec->pages[cur_page++] = page;
page += SIZE_4K;
-   } else
-   page = first_addr;
-
-   for (; page < last_addr; page += SIZE_4K) {
-   page_vec->pages[cur_page] = page;
-   cur_page++;
-   }
-
+   } while (page < end_addr);
}
+
page_vec->data_size = total_sz;
iser_dbg("page_vec->data_size:%d cur_page %d\n", 
page_vec->data_size,cur_page);
return cur_page;
 }

-#define IS_4K_ALIGNED(addr)unsigned long)addr) & ~MASK_4K) == 0)

 /**
  * iser_data_buf_aligned_len - Tries to determine the maximal correctly aligned
@@ -284,42 +276,40 @@ static int iser_sg_to_page_vec(struct is
  * the number of entries which are aligned correctly. Supports the case where
  * consecutive 

Re: [PATCH v3] [RFC] rdma/cm: support option to allow manually setting IB path

2009-10-31 Thread Or Gerlitz

Sean Hefty wrote:

Jason and Or, does this seem ready to queue for 2.6.33?
Roland, I have missed your email last week, anyway, as I wrote Sean 
earlier, I'm totally fine with this patch of allowing user space to set 
a patch record for the kernel.


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] rdma/cm: support option to allow manually setting IB path

2009-11-01 Thread Or Gerlitz

Sean Hefty wrote:

Future changes to the rdma cm can expand on this framework to support the full 
range of features allowed by the IB CM, such as separate forward and reverse 
paths and APM

Sean,

Before enhancing the rdma-cm to support the full feature set of the IB 
CM, something which I personally don't see the actual need for (but I 
will be happy to get educated what applications will or can migrate to 
rdma-cm once this is implemented), how about trying to allow for reduced 
QoS scheme also when the entity that resolved this patch didn't 
consulted with the SA?


IB QoS is based on the query providing the  
tuple and the SA returning a  QoS tuple. Now 
I'd like to see how can we let the application / querying middleware to 
take advantage of the knowledge on what partition it runs and use the SL 
associated with the IPv4 (e.g AF_INET rdma-cm ID's) IPoIB broadcast 
group. This way, one can still program a QoS scheme at the SA which is 
based on partitions.


Looking on mckey, the user space code (e.g ACM), could just do rdma_bind 
to an IP address of an IPoIB NIC that uses this partition and then 
rdma_join to an unmapped multicast address which correspond to the 
broadcast group, take the SL and leave the group, makes sense?


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] librdmacm/mckey: enforce local binding for unmapped multicast addresses

2009-11-01 Thread Or Gerlitz
enforce local binding is specified for unmapped multicast addresses, otherwise 
mckey
crashes when attempting to use the cma_id->verbs pointer in the port query verb.

Signed-off-by: Or Gerlitz 

Sean, using unmapped multicast addresses I see that a different broacast group 
is
created by the SM such that mckey doesn't manage to join the ipv4 broadcast 
group

$ ./mckey -M ff12:401b::0:0:0:: -b 10.10.5.62 -p 0x2

mckey: joined dgid: ff12:401b::: mlid c00b sl 0

looking in the SA, I see that the MGID used by the rdma-cm is a bif different
from the one used by IPoIB, since the former uses/set only the lower 28 bits 
where
the latter sets the lower 32 bits for this mgid, any idea what can be  done 
here?

$ saquery $THIS_NODE_LID

MCMemberRecord group dump:
MGIDff12:401b::::
Mlid0xC000
Mtu.0x84
pkey0x
Rate0x83
SL..0x0


MCMemberRecord group dump:
MGIDff12:401b:::fff:
Mlid0xC00B
Mtu.0x84
pkey0x
Rate0x83
SL..0x0


Index: librdmacm/examples/mckey.c
===
--- librdmacm.orig/examples/mckey.c
+++ librdmacm/examples/mckey.c
@@ -273,7 +273,7 @@ static int join_handler(struct cmatest_n
char buf[40];

inet_ntop(AF_INET6, param->ah_attr.grh.dgid.raw, buf, 40);
-   printf("mckey: joined dgid: %s\n", buf);
+   printf("mckey: joined dgid: %s mlid %x sl %d\n", buf, 
param->ah_attr.dlid, param->ah_attr.sl);

node->remote_qpn = param->qp_num;
node->remote_qkey = param->qkey;
@@ -556,6 +556,11 @@ int main(int argc, char **argv)
}
}

+   if (unmapped_addr && !src_addr) {
+   printf("unmapped multicast address requires binding to source 
address\n");
+   exit(1);
+   }
+
test.dst_addr = (struct sockaddr *) &test.dst_in;
test.connects_left = connections;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash in bonding

2009-11-02 Thread Or Gerlitz

Pradeep Satyanarayana wrote:
This crash was originally reported against Rhel5.4. However, one can recreate this crash quite easily in OFED-1.5 too. 
I understand that you get the crash when working with the RHEL5.4 
bonding driver, correct? does it happen only with IPoIB devices acting 
as the bonding slaves or also with Ethernet devices? Please note that 
with RHEL 5.4 there's no need to use the ofed provided bonding module, 
more over, I believe that the distro provided one is more stable and 
uptodate in this case. Moving forward, ofed bonding support for newish 
distributions is to be removed. Moni, any reason to support bonding/EL 
5.4 in ofed?


Or.


The steps to recreate the crash are as follows:
1. Run traffic (I used ping) on the IB interfaces through the bond master
2. ifdown ib0
3. ifdown ib1
4. modprobe -r ib_ipoib


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] librdmacm/mckey: enforce local binding for unmapped multicast addresses

2009-11-03 Thread Or Gerlitz

Sean Hefty wrote:

Unmapped multicast groups only support the case where the SA has created the
group with the MGID undefined.  The MGID must be in this format: 0xff1 scope 
0xA01B (see figure 196 on page 928 of the spec).  The kernel checks for this 
specific address format to see if it needs to convert the address or not [...] 
wanted the ability to create a group a get back a unique group ID
I am still not sure to follow you. My basic thought was that unmapped 
multicast addresses are MGIDs specified by the application such that 
rdma-cm doesn't treat them as IPv6 multicast address and no mapping is 
applied on them. From the spec location you have pointed me I understand 
that the intention is for a request to the SA to generate a unique MGID:


1. "if SA receives a request to create a multicast group with the MGID 
undefined"

2.  "the MGID that it creates shall be of the following format"

so there are two parts here, 1st request the SA to create a new group, 
assign it an MGID (what about joining this node/port to the group), 2nd, 
getting back the MGID created by the SA. Looking on the rdma-cm kernel 
code, I don't see where/how it specifies to the SA  that the MGID is 
undefined? shouldn't it not set the MGID bit in the component mask in 
this case? next, I don't see where the MGID created by the SA is given 
back to the application. I guess still miss something here, can you 
clarify, thanks


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 19/25] mlx4: Randomizing mac addresses for slaves

2009-11-04 Thread Or Gerlitz
On Wed, Nov 4, 2009 at 10:04 PM, Roland Dreier  wrote:
>> +#define MLX4_MAC_HEAD               0x2c900ULL

> Is this a good idea?  You're basically choosing 24 random bits within your 
> OUI...
> seems the chance of collision with another MAC used on the same network is
> high enough that it could easily happen in practice on a moderately big 
> network.

yes, this has been brought by Stephen and others on this last back on
September 11th, this year @
http://marc.info/?l=linux-netdev&m=125263488409128

> Can you pick a reserved range or something?

Using different OUI for the VF device wouldn't help either I think,
since the #VF becomes fairly big even on a modest side cluster with
(say) a VM consuming VF per 1-2 cores.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] librdmacm/mckey: enforce local binding for unmapped multicast addresses

2009-11-07 Thread Or Gerlitz

Sean Hefty wrote:

I merged this with your other patch to mckey and applied them to my tree
  
I don't see this @ 
http://www.openfabrics.org/git/?p=~shefty/librdmacm.git, were you 
referring a local clone?


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QoS in local SA entity

2009-11-07 Thread Or Gerlitz

Sean Hefty wrote:

I wasn't trying to limit how the SA could 'distribute' QoS information to the 
end nodes.  ACM will obtain QoS information from the SA when it joins its
multicast groups
excellent... still, this is dependent on how the ACM MGIDs are 
constructed, I'll take a look on the code.



ACM is intended to be a service that's used by the librdmacm to resolve address 
mappings and routes.  Trying to have ACM use the librdmacm ends up with a 
circular dependency.  That's the part I'm trying to avoid.


fail-enough, I believe that my suggestion is doable also without 
circular dependency, e.g as you indicated below or with a fairly small 
enhancement of librdmacm, see next




ACM uses address mappings as defined in an address configuration file (IP ->
device, port, pkey).  The address file can be created using the provided 
ib_acme utility, which uses the current system configuration (in an ugly way, 
but it works).  I think this provides QoS behavior similar to what you're 
describing
I assume you are referring to an IP local to the system where ACM runs 
on correct? this would work well for applications calling rdma_bind 
and/or rdma_resolve_address while specifying a source address. To 
support also the case of application which do neither of these two, that 
is call rdma_resolve_addr with dest address only, I suggest to enhance 
librdmacm-calling-ACM flow and resolve the source address using route 
lookup from user space, next the librdmacm can issue rdma_bind on behalf 
of this ID and you have the  triplet at your hand so 
now the ACM call can be made form librdmacm. Writing this, I realized 
that better(should) be done also for apps _resove_addr with src ip 
specified. This way you have unified flow for the ACM use in librdmacm 
for either of apps A,B,C below


A.1 rdma_bind(src=X)
A.2 rdma_resolve_addr(src=null, dst=Y)

B.1 rdma_resolve_addr(src=null, dst=Y)

C.1 rdma_resolve_addr(src=X, dst=Y)

where librdmacm calling-ACM flow is

L1. compute source address
L2. issue kernel rdma_bind to source address and resolve pkey>
L3. issue ACM address (DGID) resolution call using (pkey>, dest-ip)


makes sense? if yes, what's the need in the address configuration file?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QoS in local SA entity

2009-11-08 Thread Or Gerlitz

Jason Gunthorpe wrote:

The entire point of the rdma_getaddrinfo + AF_IB is to avoid hacking up 
librdmacm for every address lookup/cache scheme someone invents
the entire simple point I am trying to make is that rdma_getaddrinfo + 
AF_INET is doable, is simple and is needed to keep up the essence of the 
rdma-cm. I don't see how AF_IB buys anything to anyone that but if you 
want to push it up as long as AF_INET is first and most 
supported/interoperable future/present go and add your bits. As you 
indicated the route lookup I was mentioning could be done in 
rdma_addrinfo, sure with  &res including both source and destination 
addresses. No rdma_resolve_addr2 is needed the one that exists now has  
source addresses specified, I  don't see that extra info is needed for 
AF_INET that was resolved with rdma_getaddrinfo is this AF_IB specific?


I don't see why the app should bother on calling rdma_getaddrinfo, it 
can be done by librdmacm with rdma_getaddrinfo having multiple modules 
as you suggested. I am in favor of the approach suggested by Sean of 
librdmacm either doing its native flow or under environment variable 
doing an alternative flow, where your suggestion not to have the 2nd 
flow being tightly coupled with ACM, e.g through using get_addrinfo 
abstraction and friends makes sense (yes!)


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] ib/iser: re-write SG handling for rdma logic

2009-11-09 Thread Or Gerlitz



This patch re-writes the logic that does the above, to make it clearer and simpler. It also fixes a 
bug in the being aligned for rdma checks, where a "start" check wasn't done but rather 
only "end" check.
  
Roland, I don't see this patch in your for-next branch, any reason not 
to merge this?


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QoS in local SA entity

2009-11-09 Thread Or Gerlitz

Sean Hefty wrote:

[...] The current implementation of ACM converts this to:
** Source sends a multicast request to destination IP
** Destination sends a response with IP to DGID mapping
- Path record is constructed from multicast group information   
ACM needs to know what the local addresses are, so it can respond to requests
for those addresses
okay got it. Still, how do you see my suggestion on the unified/modified 
librdmacm flow (L1/L2/L3 in my email) which would be taken when working 
against a "DGID/Route" provider such as ACM?


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QoS in local SA entity

2009-11-09 Thread Or Gerlitz

Jason Gunthorpe wrote:

The extra info in rdma_resolve_addr2 carries the IB specific path information 
from the rdma_getaddrinfo module to the kernel for the address pair. The entire 
purpose of AF_IB is to let user space tell the kernel it does not want a kernel 
side ND and PR query, instead user space will provide all the information.
The kernel patches posted by Sean replace the ND/PR flow with a two 
steps process, first specifying a DGID to the kernel next specifying a 
PATH. My suggestion is to have a librdmacm initiated bind before the 
sending the DGID to the kernel, this way AF_INET would be supported 
perfectly under the slight limitation that the source address port, pkey> tuple would be chosen by route lookup and not by the 
neigh->dev that what resolved by the kernel ND. This is only when the 
modified flow of librdmacm is taken (e.g under user specification with 
environment variable etc).


--If-- on top of that you want to add AF_IB, we may be able to do that, 
but I don't see why the whole thing should be made for AF_IB only.



Think of it this way, ACM takes over the entire process of what AF_INET does in 
the kernel. AF_INET talks directly to the IB CM module in the kernel. Thus, it 
also makes sense that ACM would need to talk to IB CM directly as well. AF_IB 
is that direct connection.


I don't agree we must state it this way. I see ACM as an alternative way 
for AF_INET to resolve ND/PR.


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LID reconfiguration

2009-11-09 Thread Or Gerlitz
> One more question;  I saw librdmacm which looked nice but it does not
> support multi-path connections.  It would eliminate a lot of code if we
> could use this

what are your needs?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LID reconfiguration

2009-11-09 Thread Or Gerlitz
Jeff Roberson wrote:
> I would want a way to specify the alternate sockaddr with automatic
> failover between them.  Perhaps with some notification when a failover occured

>From your description I still don't see what the alternate address buys you. 


As was suggested here, bond two IPoIB devices, use the address of the bond in 
your librdmacm based app and automatic HA. You get indications on failover 
through RDMA_CM_EVENT_ADDR_CHANGE, see rdma_get_cm_event(3)

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash in bonding

2009-11-10 Thread Or Gerlitz
Pradeep Satyanarayana wrote:

> The crash is specific to IPoIB, and does not happen with Ethernet slaves.

okay

> Can you explain why you plan to remove this from the newer distros? This is 
> indeed news to me

we plan to remove bonding from --ofed-- as the distro provided bonding supports 
ipoib, simple as that, what isn't clear here?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RDMA] Fixup IPv6 support and IPv4 routing corner cases for RDMA CM

2009-11-11 Thread Or Gerlitz

Sean Hefty wrote:

I'll compare my final patches against the ones submitted by David to see if 
anything got missed
  
Are Jason's patches a superset of David's patches? or they need to be 
applied and only then David's work can be re-reviewed/merged, etc?


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] librdmacm/mckey: add notifications on events

2009-11-11 Thread Or Gerlitz
add notifications on multicast error and address change events which
can take place while traffic is running.

Signed-off-by: Or Gerlitz 

Index: librdmacm/examples/mckey.c
===
--- librdmacm.orig/examples/mckey.c
+++ librdmacm/examples/mckey.c
@@ -62,6 +62,7 @@ struct cmatest_node {

 struct cmatest {
struct rdma_event_channel *channel;
+   pthread_t   cmathread;
struct cmatest_node *nodes;
int conn_index;
int connects_left;
@@ -319,6 +320,30 @@ static int cma_handler(struct rdma_cm_id
return ret;
 }

+static void *cma_thread(void *arg)
+{
+   struct rdma_cm_event *event;
+   int ret;
+
+   while (1) {
+   ret = rdma_get_cm_event(test.channel, &event);
+   if (ret) {
+   perror("rdma_get_cm_event");
+   exit(ret);
+   }
+   switch (event->event) {
+   case RDMA_CM_EVENT_MULTICAST_ERROR:
+   case RDMA_CM_EVENT_ADDR_CHANGE:
+   printf("mckey: event: %s, status: %d\n",
+  rdma_event_str(event->event), event->status);
+   break;
+   default:
+   break;
+   }
+   rdma_ack_cm_event(event);
+   }
+}
+
 static void destroy_node(struct cmatest_node *node)
 {
if (!node->cma_id)
@@ -475,6 +500,7 @@ static int run(void)
if (ret)
goto out;

+   pthread_create(&test.cmathread, NULL, cma_thread, NULL);
/*
 * Pause to give SM chance to configure switches.  We don't want to
 * handle reliability issue in this simple test program.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ipath now and then (was [PATCH] IB/core: export struct ib_port)

2009-11-11 Thread Or Gerlitz
On Wed, Nov 11, 2009 at 11:06 PM, Dave Olson  wrote:
> And yes, the ib_ipath is being fully deprecated.  The "full set" of
> patches that adds ib_qib upstream will include a subset that drops
> ib_ipath.   All the bug fixes and feature work have been done for ib_qib

It was brought up in few occasions that the ipath driver can be
changed such that it becomes a software IBoE driver (e.g use packet
socket with the IBoE ether type for the IB L2 emulation).
If it doesn't have to serve for the qlogic HCA anymore, this
transformation might be even eaiser.
I wonder if its better to remove it now and maybe return it later with
the new facelift or leave it till the change is done.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] librdmacm/mckey: add notifications on events

2009-11-12 Thread Or Gerlitz
Sean Hefty wrote:
> mckey is intended to be a fairly simple send/receive multicast test program.
> What's the reasoning behind adding the event handling?

The librdmacm examples serve for multiple purposes, among them user education 
on how to write rdmacm based apps and as a vehicle to test/validate/reproduce 
features/bugs/issues, for example a follow program claimed that she isn't sure 
to get a multicast error event on her application when a port goes down, so 
with my patch to mckey we were able to see that this event is generated and we 
can now do better testing. In the future mckey can be further enhanced to 
rejoin,etc on either of the events, makes sense?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ipath now and then (was [PATCH] IB/core: export struct ib_port)

2009-11-12 Thread Or Gerlitz
Ralph Campbell wrote:
> I don't understand what you are suggesting.
> The kernel module name ib_ipath and/or directory name
> drivers/infiniband/hw/ipath could be reused for some
> other purpose certainly.

In a 2nd thought, its better that you go and remove the hw/ipath directory, I 
assume the qib code could be made to serve software iboe in the same manner 
ipath can, just make sure to keep the IB L2 handling in separate files from the 
L3/L4 ones...

Or
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] infiniband-diags/ibqueryerrors: Add support for PortXmitDiscardDetails

2009-11-14 Thread Or Gerlitz

Sasha Khapyorsky wrote:

I don't think this is the forum to discuss vendor bugs.


no way we can commit here a fix for undocumented bug

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] ib/iser: re-write SG handling for rdma logic

2009-11-14 Thread Or Gerlitz
Roland Dreier wrote:
> I just haven't been in a merging mode lately... will start working on my 
> 2.6.33 queue soon

So when more or less this work is going to start? it seems there are bunch of 
things on the plate for this cycle.

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/9] ib/addr: simplify resolving IPv4 addresses

2009-11-16 Thread Or Gerlitz
Sean Hefty wrote:
> Merge resolve local/remote address resolution into a single
> data flow to ensure consistent access and use of the local routing tables.

Sean, I reviewed patches 1-6 & 8 and they all look fine, I will give the whole 
series a try later this week to further validate them.

> Based on work from:
> David Wilder 
> Jason Gunthorpe 

David, Jason, are you planning to test these patches as well? specifically I 
assume the IPv6 work should be of interest to you...

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] [PATCHv6 0/10] RDMAoE support

2009-11-18 Thread Or Gerlitz

Eli Cohen wrote:

This new series reflects changes based on feedback from the community on the 
previous set of patches, and is tagged v6. Previous series were posted to the 
openfabrics general list only.

Changes from v5:
1. Bug fixes.
How do you expect a reviewer to learn what were the bugs and what are 
the fixes and if there are bugs that are known and weren't fixed yet? is 
one expected to do a diff between patches? where is the listing of 
changes from vX for X=1,2,3,4?


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/9] ib/addr: simplify resolving IPv4 addresses

2009-11-19 Thread Or Gerlitz
> I reviewed patches 1-6 & 8 and they all look fine, I will give the whole 
> series 
> a try later this week to further validate them

I tested the patch series (V2 for the patches that have it, V1 for the rest) 
over 2.6.32-rc5
and librdmacm-1.0.8-1.el5 covering AF_INET/PS_TCP unicast and AF_INET/PS_IPOIB 
multicast and 
bonding (operability and address-change event). I used mckey and rping, all 
worked fine, 
thanks for driving this change set, Sean. David, I'll be happy to hear how the 
IPv6 testing went, lets get this going.



Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/9] rdma/cm: fix loopback address support

2009-11-24 Thread Or Gerlitz
Sean Hefty  wrote:

> I will create a new librdmacm package that corresponds with the changes

I made all my testing of the patch set with librdmacm 1.0.10 and
patched 2.6.32-rc5 kernel, where as I wrote you, I was focusing on
AF_INET/PS_TCP and AF_INET/PS_IPOIB.
I understand that Dave was covering AF_INET6/PS_TCP with plenty of the
ipv6 variations.

So what will this new librdmacm package will let cover which wasn't
possible so far? do you refer to ipv6 support in mckey? anything else?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/9] rdma/cm: fix loopback address support

2009-11-24 Thread Or Gerlitz
> Changes were your changes to mckey, plus changes Dave added to cmatose to
> support IPv6.  The actual library itself hasn't been modified.

okay, got it. I was under the impression that mckey still misses an
option to get from the user an ipv6 multicast address which isn't all
zeros nor unmapped, correct? or the -m option will work with both ipv4
and ipv6 addresses?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMAoE verbs questions

2009-11-24 Thread Or Gerlitz

Jeff Squyres wrote:

I was reviewing Mellanox's Open MPI patches for RDMAoE support

Hi Jeff,

Can you send us point to the patch series (mail thread or some 
repository where they sit)?


1. It looks like there is a new field on the ibv_port_attr struct: 
transport. Is it expected that all device drivers will start filling 
in this value, or is it done in the OF core code somewhere?
Please note that this field isn't present in the distro provided IB 
stack and hence it is highly recommended to avoid referring it in your 
code, as least some of us (...) are for decoupling ompi from ofed, so 
lets not put sticks in the wheels of that process.


the Open MPI RDMAOE patch implies that host loopback is not supported 
in RDMAOE mode (but it is in IB mode).  To be clear, the OMPI code had 
to do something different for real IB vs. RDMAOE in at least 1 or 2 places
Liran, where this limitation comes from? isn't the HCA supporting 
bridging (loopback connections) for RDMAoE? if this is the case maybe 
you should add a device capability to mark that.


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMAoE verbs questions

2009-11-25 Thread Or Gerlitz

Jeff Squyres wrote:
Here's one thread:  
http://www.open-mpi.org/community/lists/devel/2009/11/7063.php
Jeff, looking on the threads you have sent, I didn't find a way to 
download the patch in a form which can be applied on a source tree, is 
there a way to do it through this archive? are these patches available 
from some git tree @mellanox or elsewhere? does anyone have the email 
address of Vasily Philipov (/vasily_at_[hidden]/), if yes, can you op 
Pasha please ask him to send me or better, this list the proposed patch, 
many thanks.


Or

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMAoE verbs questions

2009-11-25 Thread Or Gerlitz

Pavel Shamis (Pasha) wrote:

The patch is attached
Thanks, this patch basically replaces checks for the device transport 
type to be IB to a check that makes sure either the former happens or 
the port transport type is rdmaoe. As Jason, Tziporet and noted, the 
port transport type seems to be bad and non-comapatible/operable idea, 
so it should and probably could be avoided.


I see another patch @ 
http://www.open-mpi.org/community/lists/devel/2009/11/7063.php
can you send that one as well. The you sent patch isn't signed so I 
can't address the author in further replies (unless you are the author), 
also it wasn't generated with the -p option of diff which would show for 
each change what is the effected function, doing so would help in the 
review.


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMAoE verbs questions

2009-11-26 Thread Or Gerlitz
Pavel Shamis (Pasha) wrote:
> The only reason for this changes is the fact that for IB devices we
> prefer to use our own open mpi connection managers. In case if we will
> decide to use RDMA-CM for all devices the number of changes will be zero...

whatever, currently, this change is still there, and best if you remove it 
and find another way to set this predicate.

> So we decided to use the current ompi code as is, in future maybe we will
> implement own ompi rdmacm code that will not have all this work around flows.

just to make sure I am with you, all in all, only one patch is proposed to ompi 
for 
rdmaoe support and is the patch which we discuss above, this patch does three 
things:

1. changes BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB to look on the port 
transport type
2. if the port transport is rdmaoe don't run loopback connections on IB
3. some change in the qp destroy logic
4. that's it...

correct? can you comment on #2? why loopback connections aren't supported?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reliable IB connections (RC) and event ordering

2009-12-01 Thread Or Gerlitz
Roland Dreier wrote:
> The IBA takes into account this lack of ordering in multiple places -- 
> defining
> "communication established" async events, etc.

same goes for the IB stack... e.g take a look on the ib_cm_notify and 
rdma_notify APIs

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMAoE verbs questions

2009-12-02 Thread Or Gerlitz

Liran Liss wrote:

from an rdmacm app's point of view - there is no visible difference between IB 
and RDMAoE ports: both support the complete set of Verbs, just as any IB 
transport provider
  

wrong,  local (loopback) communication aren't supported  with RDMAoE.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMAoE verbs questions

2009-12-02 Thread Or Gerlitz

Paul Grun wrote:
Why do you say that Or? 
I said that b/c the latest patch set posted by Mellanox doesn't support 
loopback, I hear now that this was a temporal limitation which will be 
removed, let it be.


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QoS settings not mapped correctly per pkey ?

2009-12-03 Thread Or Gerlitz
Yevgeny Kliteynik wrote:
> " It looks like in 'datagram' mode, the SL weights
>   do not seem to be applied, or maybe this is an
>   artifact of IPoIB in 'datagram mode' "

yes, there's no reason for connected mode to behave differently wrt to QoS/SL 
assignment from the SM, as both modes get their SL from the path record 
provided by the SM and both mode use the same code for the path query...

> Have you checked that in this mode you do get the right
> SL for each child interface by shutting off the relevant
> SL (mapping it to VL15)?

seeing what SL is provided by the SM in return to the path query is trivial, 
either through the opensm logs or the ipoib ones, e.g here you see that ib1 got 
SL 0
on its Path to GID fe80::::0008:f104:0399:3c92 LID 0x0006 which is
10.10.0.91

> ifdown ib1
> echo 1 > /sys/module/ib_ipoib/parameters/debug_level
> ifup ib1
> ping 10.10.0.91
> dmesg | grep ib1

> ib1: Start path record lookup for fe80::::0008:f104:0399:3c92 MTU 
> > 0
> ib1: PathRec LID 0x0006 for GID fe80::::0008:f104:0399:3c92
> ib1: Created ah 81021ddda180
> ib1: created address handle 81021ddda500 for LID 0x0006, SL 0

> # ip neigh show dev ib1
> 10.10.0.91 lladdr 80:00:00:49:fe:80:00:00:00:00:00:00:00:08:f1:04:03:99:3c:92 
> REACHABLE

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: InfiniBand/RDMA merge plans

2009-12-08 Thread Or Gerlitz
Roland Dreier wrote:
> Since 2.6.31-rc8 has been out more than a week already, it's probably
> a good time to talk about 2.6.32 merge plans.  All the pending things
> that I'm aware of are listed below.

Hi Roland, any update on the 2.6.33 merge plans?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/11] RDMA/nes: abnormal listener termination causes loopback node crash

2009-12-09 Thread Or Gerlitz

Faisal Latif wrote:

when listener is destroyed for loopback connection
Does the upstream iwarp stack supports loopback connections? does it 
apply to all vendors?


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: InfiniBand/RDMA merge plans for 2.6.33

2009-12-15 Thread Or Gerlitz

Eli Cohen wrote:

 - IBoE.  In principle I think this is starting to get there.  Still
   want to see better ABI compatibility at least, and also make sure
   the interface chosen works for both rdmacm and non-rdmacm applications.

Based on this, I am going to send a new patch set, a few days after 2.6.33-rc1 is out
Eli, here are some more issues which should be on the table and you 
might want to look at before posting a new version of the patches (or 
else if you want to handle them down the road of the review process 
that's fine)


- loopback support , Liran commented that this works, does this mean 
only firmware fix is needed?


- below-the-cover-addr-resolve-in-create-AH flow races e.g 
https://bugs.openfabrics.org/show_bug.cgi?id=1866


- L2 Ethernet integration for rdma-cm based apps, namely at minimum have 
the  gang to comply 
with packets sent by the network stack for the same IP route.


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMAoE / lossless Ethernet (ewg: SC'09 BOF - Meeting notes)

2009-12-23 Thread Or Gerlitz

Liran Liss wrote:
>> all the rdmaoe materials saying the lossless traffic class is a 
must,  are you saying that this works well also  >> without it? then 
why from  architect point of view you have posed this requirement?


lossless traffic can be achieved today using global pause, for 
example.  PFC is still important; we will submit initial patches that 
support it next wee
Liran, I would say that OTOH global pause isn't the way to go and OTHO 
IB RC functions quite bad when many packets are lost. As such RDMAoE 
without PFC and mapping priorities into TCs (the Ethernet VLs) isn't 
really for production, for any non trivial environment involving more 
then one hop. Also, this email is from one month ago, any news on the 
patches?


Yevgeny, I took a look, and there are patches to support pfc for the 
mlx4_en driver, but they were never submitted upstream, which means that 
even if rdmaoe goes upstream, mainline users will not be able even to 
really test it. Also,  the pfc in these patches configuration seems to 
be done with sysfs and not through the Netlink APIs defined in 
include/net/dcbnl.c, did you had any specific reason not to integrate 
with the mainline method of pfc/tc configuration?


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/mlx4: fix post_recv wq overflow check

2009-12-23 Thread Or Gerlitz
the post recv flow should check wq overflow using the recv and not the send cq

Signed-off-by: Or Gerlitz 

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 989555c..2a97c96 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -1752,7 +1752,7 @@ int mlx4_ib_post_recv(struct ib_qp *ibqp, struct 
ib_recv_wr *wr,
ind = qp->rq.head & (qp->rq.wqe_cnt - 1);

for (nreq = 0; wr; ++nreq, wr = wr->next) {
-   if (mlx4_wq_overflow(&qp->rq, nreq, qp->ibqp.send_cq)) {
+   if (mlx4_wq_overflow(&qp->rq, nreq, qp->ibqp.recv_cq)) {
err = -ENOMEM;
*bad_wr = wr;
goto out;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMAoE / lossless Ethernet (ewg: SC'09 BOF - Meeting notes)

2009-12-23 Thread Or Gerlitz
Roland Dreier  wrote:

> I agree that implementing DCB is important for IBoE, but why do you say
> that a classical ethernet fabric with global pause isn't usable?  That
> should be roughly equivalent to an IB fabric that uses only a single VL,
> which is the case for many production IB fabrics.

To start with, no matter how many data VLs are used (e.g one), all the
crucial management traffic (SMPs) go on VL15 which is on the one hand
lossy and on the other hand not subject to congestion when other VLs
are. Now how would you manage your Cisco switch --remotely-- on a
globally paused fabric when some multicast receiver hasn't had its
breakfast and now slows the sender while filling the queues throughout
the congestion tree where this switch is part of?

To continue with, lossless is good, but to make your cluster usable
under congestion, you need congestion control, that is QCN, which is
designed/optimized to the case of multiple TCs.

Also, IBoE can potentially find its way to much more complex
environments than IB has, specifically, to clusters whose hosts are
acting as hypervisors running many many VMs and the underlying fabrics
does consolidates many types of traffic, globally pausing a port can
dramatically reduce the efficiency of such computing center which
probably was built originally to increase efficiency.

I believe that the ixgbe team well understand that, and hence their
continued DCB efforts can make the combination of RXE with
Niantic/ixgbe very intresting to test.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMAoE / lossless Ethernet (ewg: SC'09 BOF - Meeting notes)

2009-12-23 Thread Or Gerlitz
Paul Grun  wrote:
> there doesn't appear to be an argument in favor of requiring DCB with RoCEE

Interesting, the ofa server is down now, so I don't have access to ofa
IBoE materials, from my memory I recall that in ALL of them you have
made the IBoE/CEE bundling very clear & evident, e.g this  IBTA
presentation made to T11 @
http://www.t11.org/ftp/t11/pub/fc/study/09-543v0.pdf

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMAoE / lossless Ethernet (ewg: SC'09 BOF - Meeting notes)

2009-12-24 Thread Or Gerlitz

Roland Dreier wrote:

Sure, DCB is very useful, in many environments. And maybe even a requirement 
sometimes.  I'm simply trying to say that IBoE with classical ethernet is at 
least as useful as standard IB in many cases

Roland, Paul,

Putting a side for a moment the detailed discussion we've started and 
looking on the concluding remarks you have made, I wasn't sure to 
follow:  if DCB isn't available (even from a silly reason of hw 
supporting pfc but patches not being pushed to the kernel...) what you 
think would function better (or function at all) for IBoE, lossy or 
globally paused Ethernet? I haven't managed so far to convince you that 
both aren't applicable for IBoE, but I also didn't manage to see what 
are you suggesting in the absence of DCB.


Or.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMAoE / lossless Ethernet (ewg: SC'09 BOF - Meeting notes)

2009-12-24 Thread Or Gerlitz

Liran Liss wrote:

I second...
  
fair-enough, so now (A) everyone agrees that DCB is good for IBoE and 
(B) mlx4 supports pfc, any reason not to push the pfc patches into the 
kernel and have mlx4_en comply with the mainline dcbnl code?

The only way an end-node can cause congestion is if its internal buses don't 
match the IB link's BW, but this is unrelated to (lack of) transport-level flow 
control.
  

thanks for clarifying this

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/mlx4: fix post_recv wq overflow check

2010-01-06 Thread Or Gerlitz
Roland Dreier wrote:
> thanks, applied.

With this not being a regression, I see that it went into your for-next branch 
and as such I assume will be available by 2.6.34. Are you fine with the patch 
going into the -stable series? 

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/mlx4: fix post_recv wq overflow check

2010-01-07 Thread Or Gerlitz

Roland Dreier wrote:

Actually I was planning on sending it for 2.6.33, since it's so small and 
obvious and we're reasonable early in the cycle.  Not sure about -stable though 
-- has this been hit in practice?
I agree that it should go into 2.6.33, since its so small there's no 
reason to wait for 2.6.34. As for the being hit question: note that 
without there is both bug in the overflow check and creation of extra 
contention between the post recv and poll send cq flows, for ULPs that 
have their send cq different from the recv cq, e.g IPoIB, I came a cross 
this bug when reviewing the mlx4 posting code when during some profiling.


I wonder if the overflow check could be removed all together and be left 
to the ULP (kernel is trusted environment...) is there any risk in doing 
so? this way the WR posting code will not experience contention with the 
poll WC code on the CQ lock.


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 4/9] ib_core: RoCEE CMA device binding

2010-01-07 Thread Or Gerlitz
Eli Cohen wrote:
> +static int cma_resolve_rocee_route(struct rdma_id_private *id_priv)
[...]
> + route->path_rec->hop_limit = 2;

why? does this value has any specific meaning?

> + route->path_rec->mtu_selector = 2;

all the xxx_selector usages in this code should be 
transformed to be from the ib_sa.h selector enum.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ib/ipoib: remove TX moderation from the ethtool related code

2010-01-11 Thread Or Gerlitz
As of commit f56bcd8 "IPoIB: Use separate CQ for UD send completions",
there are no TX interrupts at the main code path. Change the ethtool
related code to comply with this, such the users will not be misleaded
to assume they can control TX interrupt moderation. Was pointed by
Alex Vainman 

Signed-off-by: Or Gerlitz 

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c 
b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
index e9795f6..d10b4ec 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
@@ -55,9 +55,7 @@ static int ipoib_get_coalesce(struct net_device *dev,
struct ipoib_dev_priv *priv = netdev_priv(dev);

coal->rx_coalesce_usecs = priv->ethtool.coalesce_usecs;
-   coal->tx_coalesce_usecs = priv->ethtool.coalesce_usecs;
coal->rx_max_coalesced_frames = priv->ethtool.max_coalesced_frames;
-   coal->tx_max_coalesced_frames = priv->ethtool.max_coalesced_frames;

return 0;
 }
@@ -69,10 +67,8 @@ static int ipoib_set_coalesce(struct net_device *dev,
int ret;

/*
-* Since IPoIB uses a single CQ for both rx and tx, we assume
-* that rx params dictate the configuration.  These values are
-* saved in the private data and returned when ipoib_get_coalesce()
-* is called.
+* These values are saved in the private data and returned
+* when ipoib_get_coalesce() is called
 */
if (coal->rx_coalesce_usecs   > 0x ||
coal->rx_max_coalesced_frames > 0x)
@@ -85,8 +81,6 @@ static int ipoib_set_coalesce(struct net_device *dev,
return ret;
}

-   coal->tx_coalesce_usecs   = coal->rx_coalesce_usecs;
-   coal->tx_max_coalesced_frames = coal->rx_max_coalesced_frames;
priv->ethtool.coalesce_usecs   = coal->rx_coalesce_usecs;
priv->ethtool.max_coalesced_frames = coal->rx_max_coalesced_frames;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


upstream mlx4/ib/4K mtu support

2010-01-11 Thread Or Gerlitz
Hi Vlad, I came across this ofed patch which isn't upstream. Is it a must
for making mlx4/ib/4K mtu working? was it rejected from upstream? why?

Or.


mlx4/IB: Add set_4k_mtu module parameter.

It control Infiniband link MTU for all IB ports in a host.

Signed-off-by: Vladimir Sokolovsky 
---
Index: ofed_kernel-fixes/drivers/net/mlx4/port.c
===
--- ofed_kernel-fixes.orig/drivers/net/mlx4/port.c  2009-11-09 
02:20:06.0 +0200
+++ ofed_kernel-fixes/drivers/net/mlx4/port.c   2009-11-09 02:21:46.0 
+0200
@@ -37,6 +37,10 @@

 #include "mlx4.h"

+int mlx4_ib_set_4k_mtu = 0;
+module_param_named(set_4k_mtu, mlx4_ib_set_4k_mtu, int, 0444);
+MODULE_PARM_DESC(set_4k_mtu, "attempt to set 4K MTU to all ConnectX ports");
+
 #define MLX4_MAC_VALID (1ull << 63)
 #define MLX4_MAC_MASK  0xULL

@@ -308,6 +312,9 @@

memset(mailbox->buf, 0, 256);

+   if (mlx4_ib_set_4k_mtu)
+   ((__be32 *) mailbox->buf)[0] |= cpu_to_be32((1 << 22) | (1 << 
21) | (5 << 12) | (2 << 4));
+
((__be32 *) mailbox->buf)[1] = dev->caps.ib_port_def_cap[port];
err = mlx4_cmd(dev, mailbox->dma, port, 0, MLX4_CMD_SET_PORT,
   MLX4_CMD_TIME_CLASS_B);
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA Read sge errors

2010-01-11 Thread Or Gerlitz
Jack, I see now that commit cd155c1 "IB/mlx4: Fix creation of kernel QP with 
max number of send s/g entries" is mainstream but not ofed 1.4.x and that 
mlx4_0090_fix_sq_wrs.patch (below) is in ofed but not mainstream, was it 
rejected from the mainline kernel? why?

Or.


1. Limit qp resources accepted for ib_create_qp() to the limits reported
   in ib_query_device(). In kernel space,make sure that the limits
   returned to the caller following qp creation also lie within the
   reported device limits. For userspace, report as before, and
   do adjustment in libmlx4 (so as not to break ABI).

2. Limit max number of wqes per QP reported when querying the device,
   so that ib_create_qp will never fail due to any additional headroom WQEs 
allocated.

Signed-off-by: Jack Morgenstein 

---
 drivers/infiniband/hw/mlx4/main.c|2 +-
 drivers/infiniband/hw/mlx4/mlx4_ib.h |7 +++
 drivers/infiniband/hw/mlx4/qp.c  |   25 +++--
 3 files changed, 27 insertions(+), 7 deletions(-)

Index: ofed_kernel/drivers/infiniband/hw/mlx4/main.c
===
--- ofed_kernel.orig/drivers/infiniband/hw/mlx4/main.c
+++ ofed_kernel/drivers/infiniband/hw/mlx4/main.c
@@ -122,7 +122,7 @@ static int mlx4_ib_query_device(struct i
props->max_mr_size = ~0ull;
props->page_size_cap   = dev->dev->caps.page_size_cap;
props->max_qp  = dev->dev->caps.num_qps - 
dev->dev->caps.reserved_qps;
-   props->max_qp_wr   = dev->dev->caps.max_wqes;
+   props->max_qp_wr   = dev->dev->caps.max_wqes - 
MLX4_IB_SQ_MAX_SPARE;
props->max_sge = min(dev->dev->caps.max_sq_sg,
 dev->dev->caps.max_rq_sg);
props->max_cq  = dev->dev->caps.num_cqs - 
dev->dev->caps.reserved_cqs;
Index: ofed_kernel/drivers/infiniband/hw/mlx4/mlx4_ib.h
===
--- ofed_kernel.orig/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ ofed_kernel/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -44,6 +44,13 @@
 #include 
 #include 
 
+enum {
+   MLX4_IB_SQ_MIN_WQE_SHIFT = 6
+};
+
+#define MLX4_IB_SQ_HEADROOM(shift) ((2048 >> (shift)) + 1)
+#define MLX4_IB_SQ_MAX_SPARE (MLX4_IB_SQ_HEADROOM(MLX4_IB_SQ_MIN_WQE_SHIFT))
+
 struct mlx4_ib_ucontext {
struct ib_ucontext  ibucontext;
struct mlx4_uar uar;
Index: ofed_kernel/drivers/infiniband/hw/mlx4/qp.c
===
--- ofed_kernel.orig/drivers/infiniband/hw/mlx4/qp.c
+++ ofed_kernel/drivers/infiniband/hw/mlx4/qp.c
@@ -289,8 +289,9 @@ static int set_rq_size(struct mlx4_ib_de
   int is_user, int has_srq, struct mlx4_ib_qp *qp)
 {
/* Sanity check RQ size before proceeding */
-   if (cap->max_recv_wr  > dev->dev->caps.max_wqes  ||
-   cap->max_recv_sge > dev->dev->caps.max_rq_sg)
+   if (cap->max_recv_wr > dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE ||
+   cap->max_recv_sge >
+   min(dev->dev->caps.max_sq_sg, dev->dev->caps.max_rq_sg))
return -EINVAL;
 
if (has_srq) {
@@ -309,8 +310,19 @@ static int set_rq_size(struct mlx4_ib_de
qp->rq.wqe_shift = ilog2(qp->rq.max_gs * sizeof (struct 
mlx4_wqe_data_seg));
}
 
-   cap->max_recv_wr  = qp->rq.max_post = qp->rq.wqe_cnt;
-   cap->max_recv_sge = qp->rq.max_gs;
+   /* leave userspace return values as they were, so as not to break ABI */
+   if (is_user) {
+   cap->max_recv_wr  = qp->rq.max_post = qp->rq.wqe_cnt;
+   cap->max_recv_sge = qp->rq.max_gs;
+   } else {
+   cap->max_recv_wr  = qp->rq.max_post =
+   min(dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE, 
qp->rq.wqe_cnt);
+   cap->max_recv_sge = min(qp->rq.max_gs,
+   min(dev->dev->caps.max_sq_sg,
+   dev->dev->caps.max_rq_sg));
+   }
+   /* We don't support inline sends for kernel QPs (yet) */
+
 
return 0;
 }
@@ -321,8 +333,9 @@ static int set_kernel_sq_size(struct mlx
int s;
 
/* Sanity check SQ size before proceeding */
-   if (cap->max_send_wr > dev->dev->caps.max_wqes  ||
-   cap->max_send_sge> dev->dev->caps.max_sq_sg ||
+   if (cap->max_send_wr > (dev->dev->caps.max_wqes - 
MLX4_IB_SQ_MAX_SPARE) ||
+   cap->max_send_sge>
+   min(dev->dev->caps.max_sq_sg, dev->dev->caps.max_rq_sg) ||
cap->max_inline_data + send_wqe_overhead(type, qp->flags) +
sizeof (struct mlx4_wqe_inline_seg) > dev->dev->caps.max_sq_desc_sz)
return -EINVAL;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel

Re: [PATCH 1/3] rdma_cm: Add support for a new RDMA_PS_LUSTRE Lustre port space

2010-01-14 Thread Or Gerlitz
sebastien dugue wrote:
> That can be done with port numbers, except that we cannot separate
> traffic to Lustre MDS and traffic to Lustre OSS 

Looking on these patches and going with you for a minute, I don't see how this 
patch set serves you to assign a different QoS level (e.g SL) to MDS vs OSS 
related traffic. Can you elaborate on that a bit?

Sean Hefty wrote:
> Can't this be done using port numbers in the existing port space?

Indeed, Sebastien what prevents you from using the TCP port space, with one 
port used for MDS traffic and another port for OSS traffic? how does Lustre get 
ports to listen on, are they well known or you call bind with port zero and use 
the port allocated by the rdma-cm?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 7/9] ib_core: Add API to support RoCEE from userspace

2010-01-17 Thread Or Gerlitz
Eli Cohen wrote:
> Add ib_uverbs_get_mac() to be used by ibv_create_ah() to retirieve the remote
> port's MAC address from the remote port's GID. Port link layer is also 
> returned
> by ibv_query_port()

why can't all this be implemented within libibverbs? looking on mlx4's 
implementation of ib_get_mac, it reduces to calling rdma_get_ll_mac, a two 
liner inline function which does the translation.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 4/9] ib_core: RoCEE CMA device binding

2010-01-18 Thread Or Gerlitz
Eli Cohen wrote:
> The other place is IPoIB:path_rec_completion() where we need not require
> GRH since IPoIB over RoCEE is disable

please note that can't assume that IPoIB need not use GRH, as at some future 
point this code can operate across IB subnets, for couple of years patches to 
allow for supporting that are merged into the code, e.g see 46f1b3d7 "IB/ipoib: 
Use ib_init_ah_from_path to initialize ah_attr"

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


clarification on the mlx4 CQE structure

2010-01-19 Thread Or Gerlitz
Hi Yevgeny, looking on commit f780a9f "mlx4_core: Add ethernet fields
to CQE struct" I see the following two changes:

@@ -692,14 +692,13 @@ repoll:
-   wc->sl = cqe->sl >> 4;
+   wc->sl = be16_to_cpu(cqe->sl_vid >> 12);

I wasn't sure if/why a conversion from network order to host order is
neeed here, can you clarify that?

Or.


@@ -39,17 +39,18 @@
 struct mlx4_cqe {
-   __be32  my_qpn;
+   __be32  vlan_my_qpn;
-   u8  sl;
-   u8  reserved1;
+   __be16  sl_vid;


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: clarification on the mlx4 CQE structure

2010-01-19 Thread Or Gerlitz
Yevgeny Petrilin wrote:
> This commit has an endianess bug, that was fixed in commit f781a22f.
> The cqe->sl_vid field is a be16, so we needed to convert the sl value to
> host order. Before the commit this field was two u8 fields, so no conversion 
> was needed

okay, got it, thanks 

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/mlx4: fix post_recv wq overflow check

2010-01-19 Thread Or Gerlitz

Roland Dreier wrote:

I do think it is quite common to see this WQ overflow check trigger, even for 
kernel code
mmm, why is that common? typically there's a higher layer to which the 
IB ULP advertises some sort of maximal number of credits (e.g in the 
SCSI case, iser and srp specify the maximal number of commands in the 
scsi host template) or the ULP informs a higher layer that no more sends 
can be done (e.g IPoIB calling netif_stop_queue once it sense that the 
QP filled, etc).


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] rdma_cm: Add support for a new RDMA_PS_LUSTRE Lustre port space

2010-01-20 Thread Or Gerlitz

sebastien dugue wrote:
So I guess you need to change the ports used within the new port space -- but then 
why can't you just stay in the TCP space but change the ports used?



No, with the new port space, there's no need to change ports. You only need to 
specify the target GUIDs. For example:
lustre, target-­port­guid 0x1234,0x1235 : 1 # lustre traffic to MDSs
lustre: 2 # default lustre traffic (to 
OSSs)
Hope this helps clarify things a bit.
  
sorry, but it doesn't,  as far as I understand there are three 
possibilities for what the string "lustre" is being translated to

by the opensm QoS logic:

(A) lustre port in the TCP port space
(B) lustre port space
(C) nothing (that is not a service, in the same manner that ipoib just 
doesn't mean anything to opensm)


Assuming C is not the case, then either A or B will yield the same 
result and as such the new port space buys you nothing.


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] rdma_cm: Add support for a new RDMA_PS_LUSTRE Lustre port space

2010-01-20 Thread Or Gerlitz
sebastien dugue wrote:
>  No, because in OpenSM's QoS logic, there's no way to map the TCP port
> space with specific target GUIDs onto an SL. You have keywords for SDP, SRP,
> RDS, ISER, ... but not for the TCP port space (or am I missing something?).

going with this, what prevents you from patching opensm qos engine to support
the lustre service under the tcp port-space and/or support a combination of 
service 
and target port-guid? all in all, first, I don't see what a kernel patch buys 
you
and second, if it buys you something you should be able to gain the same effect 
with
patching open-sm.

thinking on this a bit more, since the rules are processed by order wouldn't 
the 
following scheme let you achieve the same effect?

target-­port­guid 0x1234,0x1235 : 1 # traffic to MDSs
lustre: 2 # default lustre traffic (to OSSs)

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/mlx4: fix post_recv wq overflow check

2010-01-20 Thread Or Gerlitz

Roland Dreier wrote:

In other words this check catches common bugs and makes them a gazillion times 
easier to find and fix.  So unless the performance impact is extreme, I'm 
inclined to leave it
okay, lets leave this like that for unless someone comes with 
performance data that shows this is really a bottleneck.


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ib/ipoib: remove TX moderation from the ethtool related code

2010-01-20 Thread Or Gerlitz

Or Gerlitz wrote:

As of commit f56bcd8 "IPoIB: Use separate CQ for UD send completions",
there are no TX interrupts at the main code path. Change the ethtool
related code to comply with this, such the users will not be misleaded
to assume they can control TX interrupt moderation. 

Hi Roland, did you had the chance to look on this one?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rdma_bind failure over iWarp

2010-01-20 Thread Or Gerlitz
Woodruff, Robert J wrote:
> [wo...@det-17 src]$ ucmatose -b 192.168.0.17
> cmatose: starting server
> cmatose: bind address failed: No such file or directory
> return status -1

A case were rdma_bind returns -ENOENT was debugged here this week with the 
problem being the same IP assigned to two interfaces where one of them not 
being of a HCA/RNIC. I just tried assigning the same IP to on-board 1Gbs and IB 
HCA and couldn't hit the ucmatose error (2.6.33-rc4 and librdmacm-1.0.8-5.el5).

Moni, anything you can add?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] rdma_cm: Add support for a new RDMA_PS_LUSTRE Lustre port space

2010-01-21 Thread Or Gerlitz
sebastien dugue wrote:
> OK, then going with the TCP port space, what we need in OpenSM is a
> combination of service id (TCP) _and_ TCP port _and_ target GUID.

I believe that you can have a 'lustre' keyword in opensm qos parser which 
stands for the combination of tcp port space + lustre tcp port (maybe it exists 
now), so in the policy file this would translate to X,{Z1,Z2,..,Zm} (as was in 
your example) and not to X,Y,{Z1,Z2,..,Zm}. 

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ibv_asyncwatch and buffering

2010-01-21 Thread Or Gerlitz

Håkon Bugge wrote:

That would make ibv_asyncwatch more useful in scripted environments
  

patch?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ib_write_bw hanging when using max max_inline value

2010-01-23 Thread Or Gerlitz

Håkon Bugge wrote:

the test program hangs when exchanging 920 bytes [...]  the creation of QP goes 
OK
  
attaching a debugger is typically helpful to see where a program talking 
directly to the hardware hangs. If it happens on the slow pass, strace 
can be useful as well.  Did you take a look on the actual values set for 
this qp, that it as suggested by ibv_create_qp(3) look on the init 
attributes after the function returns.


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ib/ipoib: remove TX moderation from the ethtool related code

2010-01-23 Thread Or Gerlitz

Roland Dreier wrote:

Yes, looks fine, planning to merge it for 2.6.34
  
okay, good, I see that the for-next branch of yours is updated and 
already contains one patch.


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libibverbs: Force line-buffering in ibv_asyncwatch

2010-01-25 Thread Or Gerlitz
Håkon Bugge wrote:
> I used the information at 
> www.openfabrics.org/git/?p=ofed_1_2_5/libibverbs.git;a=summary 
> which states the "owner" to be Vlad. May be that confused me. I'll send a 
> copy to Roland

Roland's user space git trees are all hosted @ kernel.org 
the libibverbs one is 
git://git.kernel.org/pub/scm/libs/infiniband/libibverbs.git
you can find there the libmlx4 and libmthca ones as well.

Vlad - is there a way to prevent such confusion in the future, maybe put a 
clear comment
in the header of the ofa git page?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ib_write_bw hanging when using max max_inline value

2010-01-25 Thread Or Gerlitz

Håkon Bugge wrote:

The capabilities in qp_init_attr used as input to ibv_create_qp() are:
max_send_sge = 1, max_recv_sge = 1, max_inline_data = 928
Upon return the capabilities are modified to the following 
max_send_sge = 32, max_recv_sge = 1, max_inline_data = 928

Note decreasing the size of the RDMA to 912 bytes, the program works
Jack, sounds like this use case hits the bug/s you were attempting to 
solve with the patch set we were discussing @ 
http://marc.info/?l=linux-rdma&m=126330119309593 which that never made 
it upstream, correct?


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   8   9   10   >