Re: Merge process for OFED patches
Bart Van Assche wrote: I would like to contact the author of the fourth patch. But unfortunately I could not find any author information in that patch. yes, non signed and unreviewed patches is a common practice of ofed, does this create legal issues? maybe that would be the way to stop this? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ipv6 support in rping
Pradeep Satyanarayana wrote: > Who will be able to help us with this? Need to include the correct level of > librdmacm Sean, could you do a 1.0.9 release of librdmacm such that the ipv6 support could be distributed? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to handle illegal multicast addresses in IPoIB?
Moni Shoua wrote: > One patch > (http://lists.openfabrics.org/pipermail/general/2009-August/061663.html) > checks each > multicast address for validity before it lets it get into the queue. isn't it the below commit which appears in Linus tree? Or. > commit 5e47596bee12597824a3b5b21e20f80b61e58a35 > Author: Jason Gunthorpe > Date: Sat Sep 5 20:23:40 2009 -0700 > > IPoIB: Check multicast address format > > Check that the format of multicast link addresses is correct before > taking them from dev->mc_list to priv->multicast_list. This way we > never try to send a bogus address to the SA, which prevents badness > from erronous 'ip maddr addr add', broken bonding drivers, etc. > > Signed-off-by: Jason Gunthorpe > Signed-off-by: Roland Dreier -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible process deadlock in RMPP flow
Eli Cohen wrote: > On Wed, Sep 23, 2009 at 09:08:28AM -0700, Sean Hefty wrote: >> What kernel does 1.4.2 map to? > I think OFED 1.4.2 is based on kernel 2.6.27 but they're using RHEL 5.3 Yes, the usual mess: ofed X is based on kernel Y1 but with some additions from kernel Y2 plus plenty of unreviwed and non-merged patches. Distro Z picks ofed X and the result is 99% unsupportable as Roland said. Somehow this ofed creature is still hanging around working on the the next damage its going to bring into this world (code name 1.5) Eli, here's a little tip for you, I had the displeasure to resolve bunch of support cases originating from the fact that the below 2 years old commit missed some ofed version (sorry forgot the number...), maybe it would help you as well? Under a normal setting, if this commit actually solves a bug being hit by many costumers, someone would have opened a distro bugzilla case saying, "please pick this commit for your kernel", the customers would have either wait for the next distro update or use a distro intermediate kernel. Currently, I understand that distros are picking ofed versions and that's it. Or. commit b61d92d8ae6aa13b17d1c31e69d123879cec2ee2 Author: Sean Hefty Date: Fri Nov 30 17:30:18 2007 -0800 IB/mad: Fix incorrect access to items on local_list -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] [OPENSM] update functions to match .h prototypes
Stan C. Smith wrote: > Hello, > The following patches address inconsistencies between header file function > prototypes and .c function definitions; > missing 'const' attribute. > Attached is a Linux EOL patch file in case a mailer hacks/reformats the text. > > Signed-off-by: Stan Smith (stan.sm...@intel.com) Stan, The EWG list doesn't serve for development. As such, all patches should go to the developers list which is Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] Re: Possible process deadlock in RMPP flow
Eli Cohen wrote: Thanks Or. This one is already in OFED 1.4.2 but apparently this is a different problem. Once I have information whether the patch Roland posted fixed it I will update the list. Eli, did you find a commit that fixes the problem you reported on? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] [PATCH] mlx4: remove limitation on LSO header size
Eli Cohen wrote: Current code has a limitation as for the size of an LSO header not allowed to cross a 64 byte boundary. This patch removes this limitation by setting the WQE RR for large headers thus allowing LSO headers of any size. The extra buffer reserved for MLX4_IB_QP_LSO QPs has been doubled, from 64 to 128 bytes, assuming this is reasonable upper limit to header length. Hi Eli, Good to know that you're working on this, I assume you aim to close the missing pieces here e.g as you wrote me @ http://lists.openfabrics.org/pipermail/general/2008-March/048370.html Also, this patch will cause IB_DEVICE_UD_TSO to be set only of FW versions that set MLX4_DEV_CAP_FLAG_BLH; e.g. FW version 2.6.000 and higher. warning to users having an older firmware installed? +++ b/drivers/infiniband/hw/mlx4/main.c @@ -103,7 +103,7 @@ static int mlx4_ib_query_device(struct ib_device *ibdev, props->device_cap_flags |= IB_DEVICE_UD_AV_PORT_ENFORCE; if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_IPOIB_CSUM) props->device_cap_flags |= IB_DEVICE_UD_IP_CSUM; - if (dev->dev->caps.max_gso_sz) + if (dev->dev->caps.max_gso_sz && dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_BLH) props->device_cap_flags |= IB_DEVICE_UD_TSO; So the driver doesn't use the actual value of the max_gso_sz capability, isn't this a bug? the BLH bit (any reason not the mention in the change-log what these three letters stand for...?) serves you to support large LSO headers, but isn't enough, max_gso_sz is related to the payload and should be used, I think. if (dev->dev->caps.bmme_flags & MLX4_BMME_FLAG_RESERVED_LKEY) props->device_cap_flags |= IB_DEVICE_LOCAL_DMA_LKEY; diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 219b103..1b356cf 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -261,7 +261,7 @@ static int send_wqe_overhead(enum ib_qp_type type, u32 flags) case IB_QPT_UD: return sizeof (struct mlx4_wqe_ctrl_seg) + sizeof (struct mlx4_wqe_datagram_seg) + - ((flags & MLX4_IB_QP_LSO) ? 64 : 0); + ((flags & MLX4_IB_QP_LSO) ? 128 : 0); 64 , 128 ... here and later in build_lso_seg , how about defining some human readable something? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] rdma/cm: allow user to specify IP to DGID mapping
Sean Hefty wrote: Provide an option for user's to manually specify the socket address to DGID mapping on InfiniBand. Currently, all mappings are done using ipoib, and involve ARP. This will not work across IP subnets, and alternative mechanisms of resolving the mapping are being explored. The latter can be more efficient if combined with route resolution as well. Sean, If I understand correct your suggested changes are to optionally let an application to - instead of the following sequence of calls rdma_resolve_addr / addr resolved event rdma_create_qp rdma_resolve_route / route resolved event rdma_connect / cm events do rdma_set_ib_path rdma_create_qp rdma_connect / cm events So in that respect, I am not sure how rdma_set_dest serves you. Further, rdma_resolve_addr does three resolutions 1. the local device and source gid 2. the PKEY (VLAN) to use 3. the destination gid so in that respect, rdma_set_ib_path replaces both rdma_resolve_addr and rdma_resolve_route? I would prefer to have a solution where the app flow isn't touched, something like the kernel rdma-cm to communicate with the user space ACM daemon to get address and route resolutions. Does such a design makes sense to you? Or -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] rdma/cm: allow user to specify IP to DGID mapping
Sean Hefty wrote: > From user space, the call sequence does not change. The user calls > rdma_resolve_addr, rdma_resolve_route, rdma_connect, etc. It is up to the > librdmacm to perform the resolution. Today, the resolution request is simply > passed down to the kernel, which restricts how the resolution can be > performed. good, fair-enough > I kept resolving the address and route separate. rdma_set_ib_path, which has > always existed btw, simply sets the route/path. The new call, > rdma_set_ib_dest, sets the address mapping. To use rdma_set_ib_dest, the user > must have called rdma_bind_addr first, which covers steps 1 & 2 that you > mentioned above. The rdma_bind_addr call can be done internally to the > librdmacm as part of the rdma_resolve_addr implementation. I understand that rdma_bind_address covers the local device and vlan resolutions, but I we should also --keep-- supporting also applications that use an explicit source address in rdma_resolve_addr or that don't do bind, provide src=NULL to resolve_addr and rely on the rdma-cm to use route lookup (as the rdma_resolve_addr man page indicates) for the device/vlan resolution. > If a user sets the wrong address mapping or route, they should only affect > themselves I wasn't sure to follow this comment, can you elaborate a bit more? > (FYI - I have not yet implemented the librdmacm to call rdma_bind_addr as part > of rdma_resolve_route on linux. I did not see an easy way to convert a > destination IP address to a source IP address. If anyone knows how, please > let > me know.) I assume you was referring rdma_resolve_addr, correct? there should be a way to do that from user space and if not, you can go down to the kernel, resolve the device/vlan and then call ACM to resolve the destination. It seems that you must resolve the dev/vlan for issuing the ACM ARP replacement... > >I would prefer to have a solution where the app flow isn't touched, > >something like the kernel rdma-cm to communicate with the user space ACM > >daemon to get address and route resolutions. Does such a design makes > >sense to you? > Long term, this is exactly the type of flow that I envision. I'd like to have > real data to show that the ACM implementation scales first, which is part of > my > problem. I do not have the ability to easily change kernel drivers on any > larger sized clusters. My approach is to allow user space to perform the > address and route resolution and pass the data to the kernel. This way, we > have > the freedom to test multiple solutions, until we can settle on what works. I am not sure to fully follow on the easily-change-kernel-drivers claim, isn't some change to the kernel rdma-cm being a must for the ACM + librdmacm solution to work? suppose you have a way to fully do the addr+route resolutions from user space, will the kernel rdma-cm state machine will be willing to issue rdma_create_id rdma_set_ib_path (you said this exists today?) rdma_create_qp rdma_connect ??? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] rdma/cm: allow user to specify IP to DGID mapping
On Thu, Oct 8, 2009 at 1:42 AM, Sean Hefty wrote: > My intent, which differs from Jason's, was to fully support the existing > librdmacm interfaces as they are defined. yes, I agree this is the way to go > Implementation wise, if the user of the librdmacm calls rdma_resolve_addr > with a > src address, it's easy. Without the src address, it's hard, but I may just be > missing some easy interface for finding the src address. note that dst can map to multiple src addresses, so you're just looking for one of them... its doable, I will get you the details if you still need them >>> If a user sets the wrong address mapping or route, they should only affect >>> themselves >> I wasn't sure to follow this comment, can you elaborate a bit more? > I meant that if some bogus app wants to specify an IP to GID mapping that's > invalid, the incorrect mapping should only affect connections for that app. yes, this makes sense and I believe the rdma-cm code is written such that one bugus ID doesn't leak its defections to other IDs > I can somewhat implement an ACM + librdmacm solution entirely in user space by > layering the librdmacm over libibcm. Because of the event reporting, it > would be limited > in how it could be use, and is unlikely to be something that would ever be > supported. yes, it would be limited and not really supportable, going that way for research / experimentation and development is fine, just make sure to never release that... > Technically, rdma_resolve_addr could remain unchanged, in which case it will > do > everything it does today, which may include sending an ARP. This is the > specific operation that I'd like to avoid. again, apps (both user and kernel ones) do use rdma_resolve_addr and we want them to keep doing so (I thought we agreed on that). For staging you may develop the type II address resolution prototype on top of libibcm but later rdma_resolve_addr would call IBACM and then sync with the kernel. Basically, can we agree that rdma_resolve_addr(src, dst, timeout) of type II it would look like if (src) rdma_bind(src) else call_some_user_space_networking_api_to_convert_dst_to_netdev/src next, now we have dev/pkey - call ACM to resolve dst IP to GID and use dev/pkey for that - sync the kernel rdma_cm with the resolution if needed for the state-machine (hopefully its not a must at this point and can be done when calling set_path). Or -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] rdma/cm: allow user to specify IP to DGID mapping
Sean Hefty wrote: >>When used over IB, the IP address is little more than a qualifier contained >>within the IB CM REQ private data. > > If we added support for AF_GID/AF_IB to the kernel, the rdma_cm could leave > all > of the private data carried in the IB CM REQ entirely up to the user. If the > user happens to format that data to look like the CMA header, so be it. I > believe this would allow for a 'clean' implementation of rdma_resolve_addr, > preserve the ABI, and still allow a library to provide backwards > compatibility. Sean, So in this design librdmacm will change the user supplied AF_XXX in the provided sock address and set it to AF_GID/IB, sounds okay. > Would this approach combined with the ability to set the route work for > everyone? yes, it makes sense. However, I don't manage to follow on your port space discussion with Jason. Some apps may have client in user space and server in the kernel or vise versa. I wouldn't tie PS_IB or a like with ACM. The ACM ARP replacement protocol will reply only if the ip address specified in the broadcast request is an ip of this host on that pkey and a port connected to that fabric, correct? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] rdma/cm: allow user to specify IP to DGID mapping
Jason Gunthorpe wrote: > If the listening side continues to use the IP mode to listen then I guess the > client can compute an appropriate service ID, but it seems a bit > strange for one side to use IP and the other side to use the ACM > method? I was imagining you'd configure both sides to use the same method. 1st, unlike IP, in IB only the active/connecting side does address resolution, 2nd, the listener may be in the kernel where the active may be user space, but anyway, ACM is an alternative way to do destination gid resolution and path query emulation, I don't see what it has to the with the CM protocol expect for keeping things the way they were in this respect (rdma-cm IP header in the REQ, etc). I don't see why if someone is resolving address through ACM they aren't PS_TCP consumers. Or -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] rping is not resolving ipv6 addresses
David J. Wilder wrote: > I added an option to rping to specify a source address and supply it to patch? > rdma_resolve_addr(), but now it is failing rdma_resolve_route(). > $ ./rping -d -c -v -a fe80::202:c903:1:1925 -i fe80::202:c903:1:28ed > cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0x100213d0 (parent) > cma_event type RDMA_CM_EVENT_ROUTE_ERROR cma_id 0x100213d0 (parent) > cma event RDMA_CM_EVENT_ROUTE_ERROR, error -22 what does the neighbour info (ip neigh show | grep 1925) shows after running rping? can you do ipoib ping and ping6 to the fe80::202:c903:1:1925 host? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] rping is not resolving ipv6 addresses
David J. Wilder wrote: > If I run rping without my rping change to add the source address to > rdma_resolve_address(), ip neigh show gives: fe80::202:c903:1:1925 dev eth1 > FAILED > Notice that interface is incorrect, it should be ib0. tcpdump showed the > neighbor-discovery sent out the eth0 interface. yes, this is as of what Roland explained. > Running with my rping change to specify the local-link address of my ib0 > interface "ip neigh show" never shows any entry for fe80::202:c903:1:1925 mmm... weird, run your rping with tcp dump in another screen and see if ND takes place > ping6 will work but I must specify the interface to use: ping6 > fe80::202:c903:1:1925%ib0 after the tcpdump experiment, run ping6 and immediatly following that or in parallel on another window run rping -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rping is not resolving ipv6 addresses
Sean Hefty wrote: > The rdma cm was never fully coded or tested for ipv6 support. Sean, even if not fully coded/tested, some work has been done, e.g commits 38617c64 "RDMA/addr: Add support for translating IPv6 addresses" and 1f5175ad "RDMA/cma: Add IPv6 support". I suggest we'll try to see what does it take to make this better or even fully support ipv6. Jason, can you restate what are the two problems you saw from David's reports? the 1st was related to scope in link-local addresses, and what's the 2nd? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OpenSM Failover
Yevgeny Kliteynik wrote: There was a hand-over problem in OFED 1.4, but later it turned out to be FW issue. The thing is, FW version 2.6.648 doesn't have this bug any more... so things should work fine with the newly released 2.7 firmware? if this is still under question, Aaron, I suggest you open a bugzilla case @ https://bugs.openfabrics.org and we can track from there. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] rdma/cm: support option to allow manually setting IB path
Sean Hefty wrote: Before spending any more time on this patch series, is there any disagreement to accepting this patch (as is or slightly modified) upstream? Hi Sean, This patch just sets a route to the kernel and have the kernel issue a route resolved event in return, sounds good to me, I don't see any problem with merging it upstream. However, we still have a discussion to continue on the slightly bigger picture which is related to how address resolution is "set" to the kernel, what port spaces would be supported, etc, and this discussion is somehow gets closer to the ACM design... lets continue with that on the "rdma/cm: allow user to specify IP to DGID mapping" thread Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: switching the active interface for bonding
Sumeet Lahorani wrote: We are [...] trying to simulate the effect of a bonding failover initiated by a switch failure using echo commands in parallel to the /sys/class/net/bond0/bonding/active_slave file on a few of the nodes attached to the switch. Is this an acceptable technique? yes We are trying to avoid actually resetting the switch to avoid affecting other nodes connected to the same switch, since the other nodes are being used for other purposes There's no need to reboot a switch in order to cause an IB link down event on an HCA port across the wire connected to one of the switch ports. You can administratively disable the switch port you want and later administratively enable it. This is simple as $ ibportstate disable/query/enable $LID $PORT using the switch one and the switch port the hca port is connected to. Would there be any difference in terms of the code path which the bonding driver/ofed stack follows when we do this as opposed to resetting the switch? yes and yes. Bonding wise, when setting the active slave through sysfs, the bonding driver doesn't go through the link monitoring code, wheres if you do cause a link down it does. As for the IB stack (there's nothing like "ofed stack", ofed is just a bunch of rpms installed over your distro), when a port goes down, things happen... if the software you're using counts/uses IB port down events, you may exercise a different flow, e.g IPoIB is using these events, and you will not go through the port down flow of it. Next, if some code you're working with uses the IB RC transport, then depending on the timeout programmed to the RC QP, a transport timeout may happen which in turn causes the HW to move the QP into the error state, and so on. Or -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: switching the active interface for bonding
Sumeet Lahorani wrote: We are using OFED 1.4.2 Please note that the bonding driver provided by the latest distros supports IPoIB. So if your distro happen to be RHEL 5.4 (or its OEL 5.4 derivative), or SLES11 you can and should use the distro provided bonding. Moving forward, OTOH customers would use only distro code and OTOH bonding will be moved out from ofed, so best if you better start working now as things will be outside anyway. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] rdma/cm: allow user to specify IP to DGID mapping
Sean Hefty wrote: From the perspective of IB, the RDMA CM simply defines a specific format to private data and service ID carried in the IB CM REQ. As long as any use adheres to that protocol, interoperability won't be an issue. okay, I just wanted to make sure that the whole thing (ACM + modified librdmacm + modifed rdma-cm) is applicable AND inter-operable for AF_INET / PS_TCP applications. Looking on kernel cma.c format_hdr code it first branches on the address family and next of the port space. Going with your proposed flow, I understand that an app call to rdma_resolve_addr will be broken down to rdma_bind_addr, ACM resolution of the destination GID and then rdma_set_ib_dest, so things should work perfect for AF_INET / PS_TCP apps, correct? The only missing piece here is the route lookup from user space for applications not specifying a source address in their rdma_resolve_addr invocation, do you still need help to implement that? Essentially, the RDMA CM interface would become capable of connecting to any IB application. (I really haven't thought through the details yet, and the addition of RDMA_PS_IB shouldn't be part of the initial patch submission.) fair-enough, I just wanted to make sure with you that AF_IB / PS_IB aren't tightly coupled with the proposed change and you have clarified that. The ACM responds based on a configuration file. The ib_acme utility can create that file using the active IP, pkey, port information of the system, but the current ACM implementation does not adjust to dynamic changes or detect misconfigurations or other made up words. I see. Does the new flow of librdmacm is going to be under new API, eg rdma_resolve_addr/route_ext or the same API optionally talking to ACM through some IPC if the ACM daemon is running, or something else? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] rdma/cm: support option to allow manually setting IB path
Or Gerlitz wrote: sounds good to me, I don't see any problem with merging it upstream. Hi Sean, Are you moving forward with these patches to 2.6.33 ? Or -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ib/iser: re-write SG handling for rdma logic
After dma-mapping an SG list provided by the SCSI midlayer, iser has to make sure the mapped SG is "aligned for RDMA" in the sense that its possible to produce one mapping in the HCA IOMMU which represents the whole SG. Next, the mapped SG is formatted for registration with the HCA. This patch re-writes the logic that does the above, to make it clearer and simpler. It also fixes a bug in the being aligned for rdma checks, where a "start" check wasn't done but rather only "end" check. Signed-off-by: Alexander Nezhinsky Signed-off-by: Or Gerlitz Index: linux-2.6.32-rc5/drivers/infiniband/ulp/iser/iser_memory.c === --- linux-2.6.32-rc5.orig/drivers/infiniband/ulp/iser/iser_memory.c +++ linux-2.6.32-rc5/drivers/infiniband/ulp/iser/iser_memory.c @@ -209,6 +209,8 @@ void iser_finalize_rdma_unaligned_sg(str mem_copy->copy_buf = NULL; } +#define IS_4K_ALIGNED(addr)unsigned long)addr) & ~MASK_4K) == 0) + /** * iser_sg_to_page_vec - Translates scatterlist entries to physical addresses * and returns the length of resulting physical address array (may be less than @@ -221,62 +223,52 @@ void iser_finalize_rdma_unaligned_sg(str * where --few fragments of the same page-- are present in the SG as * consecutive elements. Also, it handles one entry SG. */ + static int iser_sg_to_page_vec(struct iser_data_buf *data, struct iser_page_vec *page_vec, struct ib_device *ibdev) { - struct scatterlist *sgl = (struct scatterlist *)data->buf; - struct scatterlist *sg; - u64 first_addr, last_addr, page; - int end_aligned; - unsigned int cur_page = 0; + struct scatterlist *sg, *sgl = (struct scatterlist *)data->buf; + u64 start_addr, end_addr, page, chunk_start = 0; unsigned long total_sz = 0; - int i; + unsigned int dma_len; + int i, new_chunk, cur_page, last_ent = data->dma_nents - 1; /* compute the offset of first element */ page_vec->offset = (u64) sgl[0].offset & ~MASK_4K; + new_chunk = 1; + cur_page = 0; for_each_sg(sgl, sg, data->dma_nents, i) { - unsigned int dma_len = ib_sg_dma_len(ibdev, sg); - + start_addr = ib_sg_dma_address(ibdev, sg); + if (new_chunk) + chunk_start = start_addr; + dma_len = ib_sg_dma_len(ibdev, sg); + end_addr = start_addr + dma_len; total_sz += dma_len; - first_addr = ib_sg_dma_address(ibdev, sg); - last_addr = first_addr + dma_len; - - end_aligned = !(last_addr & ~MASK_4K); - - /* continue to collect page fragments till aligned or SG ends */ - while (!end_aligned && (i + 1 < data->dma_nents)) { - sg = sg_next(sg); - i++; - dma_len = ib_sg_dma_len(ibdev, sg); - total_sz += dma_len; - last_addr = ib_sg_dma_address(ibdev, sg) + dma_len; - end_aligned = !(last_addr & ~MASK_4K); + /* collect page fragments until aligned or end of SG list */ + if (!IS_4K_ALIGNED(end_addr) && i < last_ent) { + new_chunk = 0; + continue; } + new_chunk = 1; - /* handle the 1st page in the 1st DMA element */ - if (cur_page == 0) { - page = first_addr & MASK_4K; - page_vec->pages[cur_page] = page; - cur_page++; + /* address of the first page in the contiguous chunk; + masking relevant for the very first SG entry, + which might be unaligned */ + page = chunk_start & MASK_4K; + do { + page_vec->pages[cur_page++] = page; page += SIZE_4K; - } else - page = first_addr; - - for (; page < last_addr; page += SIZE_4K) { - page_vec->pages[cur_page] = page; - cur_page++; - } - + } while (page < end_addr); } + page_vec->data_size = total_sz; iser_dbg("page_vec->data_size:%d cur_page %d\n", page_vec->data_size,cur_page); return cur_page; } -#define IS_4K_ALIGNED(addr)unsigned long)addr) & ~MASK_4K) == 0) /** * iser_data_buf_aligned_len - Tries to determine the maximal correctly aligned @@ -284,42 +276,40 @@ static int iser_sg_to_page_vec(struct is * the number of entries which are aligned correctly. Supports the case where * consecutive
Re: [PATCH] librdmacm: initialize correct pthread condition in rdma_join_multicast
Sean Hefty wrote: rdma_join_multicast re-initializes id_priv->cond rather than mc->cond. Fix this. Bug reported by Nir Naaman any idea what's the impact of this bug? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] rdma/cm: allow user to specify IP to DGID mapping
Sean Hefty wrote: okay, I just wanted to make sure that the whole thing (ACM + modified librdmacm + modifed rdma-cm) is applicable AND inter-operable for AF_INET / PS_TCP applications I do not intend to have any changes that break anything my question went beyond whether things are going to be broken (they aren't as you said), but rather will ACM is going to be ***applicable*** for AF_INET/PS_TCP application. From your reply and the discussion that followed between you and Jason, I got the impression that the answer is "not really" b/c if for example the server side thinks it would be getting the IP address of the connecting side in the REQ private header, once this REQ was sent in the flow of AF_INET which was converted to AF_IB, this is not going to happen. Moreover, if the SID constructed by AF_INET / PS_TCP call to rdma_resolve_address which uses the librdmacm-ACM flow wouldn't match the SID constructed in the passive side which didn't use this flow (e.g user --> kernel or kernel --> user app), the REQ wouldn't be getting anywhere and be rejected by the CM on the passive side :( Going with your proposed flow, I understand that an app call to rdma_resolve_addr will be broken down to rdma_bind_addr, ACM resolution of the destination GID and then rdma_set_ib_dest, so things should work perfect for AF_INET / PS_TCP apps, correct? This is my current plan for the kernel: export rdma_set_ib_paths to user space. Submit a patch. Get it accepted upstream. Eat ice cream to celebrate. again, rdma_set_ib_path for itself is quite innocent... as I wrote you couple of days ago, it can be merged anytime, the big thing is the bind / address resolution modified flow which effects the connect/listen, etc. So just for this patch, I would go on a small size ice-cream, where once the design for the bigger picture is in place, go for a pint... Define AF_IB and struct sockaddr_ib (contains a gid and service id). Update rdma_bind_addr, rdma_resolve_addr, and rdma_connect to handle AF_IB. rdma_bind_addr fills in the sid according RDMA IP CM service annex. rdma_resolve_addr just needs to save the GIDs. rdma_connect will not modify the private data in the CM REQ for AF_IB. I really tried to follow the thread between you and Jason with quite little success, and I am going to give it more tries... in parallel, could you help me understand what is the --drive/reasoning-- from your perspective to add AF_IB / PS_IB here? I believe that the suggestion I brought of: converting rdma_resolve_addr with null src addr to route lookup and following that rdma_bind_addr, with a similar/same flow for rdma_resolve_addr with src address, next do the ACM dgid resolution, call the rdma_set_dgid call. Would allow to serve AF_INET / PS_TCP with ACM. If from other reasons, people want the rdma-cm to support AF_IB and/or PS_IB, we can do that as well, but why force doing that behind the cover each time ACM is used?! Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] rdma/cm: allow user to specify IP to DGID mapping
Sean Hefty wrote: These are the things done today in the kernel wrt IB: * Map a local or remote IP address to a GID * If a local address is not given, provide a usable address based on the destination address * Acquire a path between the source and destination * Format the first 36 bytes of private data in the CM REQ Any or all of these could be done in user space instead. Adding AF_IB to the kernel can provide a clean way of enabling this. It can also allow full support of IB CM functionality through the RDMA CM interfaces Sean, First, on top of what you have mentioned above, the kernel also generates the SID to connect to / listen on, maintains a "binding" (mapping) between an rdma-cm id to a netdevice which today is used for generating address change events, and maybe some more tasks which I neither of us brought. From what you write here I understand that the reasoning is something like: 1. we can do all this in user space 2. for that end AF_INET/PS_TCP flow has to be converted to AF_IB/PS_IB behind the cover well, you didn't address some of my comments (not the ice-cream ones...), which come to say that this wouldn't be inter-operable if for one side you convert INET/TCP to IB/IB and for the other side you don't (e.g userA/userB user/kernel kernel/user etc schemes). Also the functionality added under the bonding scheme is lost, etc. I am asking you to have INET/TCP apps enjoy both ACM's DGID and route resolution without being converted to IB/IB, simple as that. If needed I'd be happy to assist in making this flow happen. The rdma-cm was born first and most to serve as a glue between the IP and RDMA worlds, and I just ask you, as the maintainer, to keep this well-going-glue happening also under ACM. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] rdma/cm: allow user to specify IP to DGID mapping
Jason Gunthorpe wrote: So why not have a more general, flexible approach? Isolating ACM from librdmacm by using AF_IB is a good idea, it keeps them seperate and lets ACM and future go where ever. I hope Sean can make it work with the rdma_getddrinfo idea, that would completely seperate ACM and librdmacm Generally speaking, AF_IB/PS_IB sounds okay to me, even though I am not clear what applications are going to use it, maybe some examples please? Attempting to bake it into AF_INET means that librdmacm, possibly the kernel and maybe even the apps need to be contaminated with ACM specific code, and that just doesn't seem desirable to me. What happens when someone invents BCM or CCM? More mess I don't agree, the only place where librdmacm goes to ACM is to resolve DGID and a route. This can be done with rdma_getdgidinfo & rdma_getrouteinfo if you like (maybe you do the implementation?), or with ACM (later BCM, CCM) plugin used by librdmacm or by calls from librdmacm to ACM. But in any case, the kernel code nor the app will not be contaminated. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] link-local address fix for rdma_resolve_addr
Jason Gunthorpe wrote: Mainly for RDMA what you get is more kernel flexability, IP like capabilities, better bonding, and better support of IB APM semantics: - bind() + listen() actually works properly if more than one interface is bound to the same IP - the cm_id returned by accept is bound to the hca and port that accepted the connection [ This is a L3 form of bonding Linux supports ] This is actually something of a mandatory notion to implement the full generality of the IB CM protocol which allows the CM REP to contain a port GUID of another port on the same node (multi-port APM is an IB feature). So you never know what port the accept() result will get bound to. BTW: I suppose ideally AF_IB would have a way to say 'CM accept REPs on any port on this node' Hmm, reserved GID prefix perhaps? Hmm. When used with bonding this would also afford the kernel with the ability to accept incoming connections across all the redundant RDMA devices - and still have correct bound-to-IP semantics. - rdma_resolve_addr more or less as the inverse of all the above * multiple interfaces with same IP case works, kernel and routing table can distribute outgoing connections * multi-port APM works, kernel and user space can choose primary and backup port for the IP addy * bonding works, kernel can balance outgoing connections across the bond slaves. These are all useful features. Jason, Have you even looked into or tested any of the bonding load-balancing modes with ipoib? some/most of them are not applicable to IPoIB and I don't think that the ones which may be such were ever tested. Next, multiple interfaces with the same ip address isn't something I see very useful for production environment (but I'd be happy to get educated what L3 bonding is and where it can play), next, multi-port APM isn't something I ever heard to be required by customers and more important, from comments made by Sean in the past, I don't think it fits the rdma-cm spirit. All in all, someone comes here and suggests some fixes to the rdma-cm address resolution code to have IPv6 work. I don't think Dave should carry on his back/patch all your proposed future enhancements. Let him fix things and following that you can work on the patches to support all these nice/nitch features starting with IPv4 and then IPv6. Or -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] link-local address fix for rdma_resolve_addr
Jason Gunthorpe wrote: I was saying that point in the rdmacm where the rdma_cm_id is bound to a local RDMA device should have only been rdma_resolve_addr and rdma_accept. Overloading rdma_bind_addr to both bind to an IP and bind to an RDMA device was a bad API choice. As you wrote, for the most case, binding comes into play only for users calling rdma_resolve_address or rdma_accept, for users the need explicit binding the rdma-cm provides rdma_bind and it binds to both IP and Device, if you can do better, send a patch, binding can't be removed from the API since it has users and it makes sense from users to require it. Sean is right, there may be special cases that require an early binding, but a seperate API - like IP's SO_BINDTODEVICE - would has been better - and users are forewarned that calling it restricts the environments their app will support its just naming, will $ sed s/rdma_bind/rdma_set_opt(RDMA_BINDTODEVICE)/g make you happier? why? As it stands we have several impossible situations. Sean, Dave, and I were disucssing the trades offs of what this means relative to IP route resolution Don't tell me that Dave's patches are blocked b/c you discovered the rdma_bind design and now you don't like it, as I wrote you, Dave sent patch to fix the IPv6 support, during the discussion on his patches you come and bring up more and more issues you consider as problems (but I don't) and block the patch set, I don't think this is appropriate. Let the patches go and send your patches to fix the problems you see. Why anyone touching some code piece has to fix problems you see in that piece?! - but it affects bonding too. If you rdma_bind_addr to the IP of a bonding device, the stack must pick one of the local RDMA ports immediately. If you then call rdma_listen there is a problem: incoming connections may target either RDMA device, but you are only bound to one of them. An app cannot say 'I want to listen on this IP, any RDMA device' with the current API, as you can in IP, and that is a shame An app can say, I want to listen on that IP and the RDMA device which is associated with this IP now. When bonding does fail-over it generated a netevent, the rdma-cm catches this event and generates address change event, apps can redo their bind/listen at this point. For the time being, we never got a user report on a problem, people are doing listen on all IPs probably which works perfect with bonding. Currently the HA mode of bonding will respond on ARP only on one of the devices and as such connection requests will not target any rdma device but rather only the active one. If this is such a shame, send fix, spraying mud on the maintainer and/or someone sending another patch is a shame, isn't it? Traditionally with ethernet the L2 bonding is really only used for link aggregation, L1 failure, and a simple multi-switch HA scheme. It is not deployed if you have multiple ethernet domains. Some people prefer to have dual, independent ethernet fabrics, and in that case you rely on routing features to get the multipath, and HA features of bonding. okay, thanks for the crash course. Go back on the list and look up the posts from Leo who first discovered this, what he was trying to do is kinda the L3 bonding approach. if Loe has a problem and you want to help him, bring it on the list, debate, send patches, jumping into someone's else patch isn't the constructive way to go. David has been doing a good job and I am glad he is working on the IPv6 support. My comments are only intended to clarify how this is all supposed to work and why the IP flow is actually still relevant to RDMA connections. As I see it, your comments block the the patches sent by Dave, Sean? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] link-local address fix for rdma_resolve_addr
Jason Gunthorpe wrote: Wow, seriously? You do understand the purpose of review, right? I think I do, maybe not to the depth you and your arguments are, but again, repeating myself: my kind of simple argument is that your review is way beyond the --change-- suggested by a patch but rather of a whole logic, and you block a patch b/c you don't like the logic this patch integrates with. To some extent such practice is excepted, but you took it to way beyond acceptable limit. I don't accept your assertion that the whole logic is broken and it makes sense to me to have a patch from Dave to fix the IPv6 part of it. Next or in parallel you are welcome to sent a patch fixing/re-writing the whole bind logic or even the whole rdma stack or the whole kernel. And yes, actually, accounting for how rdma_bind() is different from bind() when doing route resolution is pretty much the main remaining problem go and fix that Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RDMA] Fixup IPv6 support and IPv4 routing corner cases for RDMA CM
Jason Gunthorpe wrote: > **COMPILE TESTED ONLY** any reason why other people have to test for you? > Convert the address resolution process for outgoing connections > to be very similar to the way the TCP stack does the same operations. > This fixes many corner case bugs that can crop up. rdma_join_multicast(3) states that "before joining a multicast group, the rdma_cm_id must be bound to an RDMA device by calling rdma_bind_addr or rdma_resolve_addr", please make sure that this flow isn't broken by your patch. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] opensm: Add initial support for optimized SLtoVLMappingTable programming
Hal Rosenstock wrote: On Thu, Oct 29, 2009 at 10:23 PM, Sasha Khapyorsky wrote: Implementation description would be very useful. What does "initial support" mean? It means there's more to come in terms of using OptimizedSLtoVLMappingProgramming. This is the simplest use/introduction of this optional feature. You can just send people to reads specs, your change log should explain what the patch is about, if this is a big change to opensm, maybe even RFC it will a detailed writeup Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RDMA] Fixup IPv6 support and IPv4 routing corner cases for RDMA CM
Jason Gunthorpe wrote: On Wed, Oct 28, 2009 at 10:05:19AM -0700, Sean Hefty wrote: A UD endpoint can communicate using multicast and to other UD endpoints. A user could resolve a UD endpoint before joining a multicast group. So the IP world analog would be: fd = socket(AF_INET,SOCK_DGRAM); connect(fd,'Some Unicast Address'); setsockopt(fd,IP_MULITCAST_ADD_MEMBERSHIP,'Some Multicast Address'); sendto(fd,...,'Some Multicast Address'); IP multicast senders don't call IP_ADD_MEMBERSHIP, only receivers Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RESEND] ib/iser: re-write SG handling for rdma logic
After dma-mapping an SG list provided by the SCSI midlayer, iser has to make sure the mapped SG is "aligned for RDMA" in the sense that its possible to produce one mapping in the HCA IOMMU which represents the whole SG. Next, the mapped SG is formatted for registration with the HCA. This patch re-writes the logic that does the above, to make it clearer and simpler. It also fixes a bug in the being aligned for rdma checks, where a "start" check wasn't done but rather only "end" check. Signed-off-by: Alexander Nezhinsky Signed-off-by: Or Gerlitz Index: linux-2.6.32-rc5/drivers/infiniband/ulp/iser/iser_memory.c === --- linux-2.6.32-rc5.orig/drivers/infiniband/ulp/iser/iser_memory.c +++ linux-2.6.32-rc5/drivers/infiniband/ulp/iser/iser_memory.c @@ -209,6 +209,8 @@ void iser_finalize_rdma_unaligned_sg(str mem_copy->copy_buf = NULL; } +#define IS_4K_ALIGNED(addr)unsigned long)addr) & ~MASK_4K) == 0) + /** * iser_sg_to_page_vec - Translates scatterlist entries to physical addresses * and returns the length of resulting physical address array (may be less than @@ -221,62 +223,52 @@ void iser_finalize_rdma_unaligned_sg(str * where --few fragments of the same page-- are present in the SG as * consecutive elements. Also, it handles one entry SG. */ + static int iser_sg_to_page_vec(struct iser_data_buf *data, struct iser_page_vec *page_vec, struct ib_device *ibdev) { - struct scatterlist *sgl = (struct scatterlist *)data->buf; - struct scatterlist *sg; - u64 first_addr, last_addr, page; - int end_aligned; - unsigned int cur_page = 0; + struct scatterlist *sg, *sgl = (struct scatterlist *)data->buf; + u64 start_addr, end_addr, page, chunk_start = 0; unsigned long total_sz = 0; - int i; + unsigned int dma_len; + int i, new_chunk, cur_page, last_ent = data->dma_nents - 1; /* compute the offset of first element */ page_vec->offset = (u64) sgl[0].offset & ~MASK_4K; + new_chunk = 1; + cur_page = 0; for_each_sg(sgl, sg, data->dma_nents, i) { - unsigned int dma_len = ib_sg_dma_len(ibdev, sg); - + start_addr = ib_sg_dma_address(ibdev, sg); + if (new_chunk) + chunk_start = start_addr; + dma_len = ib_sg_dma_len(ibdev, sg); + end_addr = start_addr + dma_len; total_sz += dma_len; - first_addr = ib_sg_dma_address(ibdev, sg); - last_addr = first_addr + dma_len; - - end_aligned = !(last_addr & ~MASK_4K); - - /* continue to collect page fragments till aligned or SG ends */ - while (!end_aligned && (i + 1 < data->dma_nents)) { - sg = sg_next(sg); - i++; - dma_len = ib_sg_dma_len(ibdev, sg); - total_sz += dma_len; - last_addr = ib_sg_dma_address(ibdev, sg) + dma_len; - end_aligned = !(last_addr & ~MASK_4K); + /* collect page fragments until aligned or end of SG list */ + if (!IS_4K_ALIGNED(end_addr) && i < last_ent) { + new_chunk = 0; + continue; } + new_chunk = 1; - /* handle the 1st page in the 1st DMA element */ - if (cur_page == 0) { - page = first_addr & MASK_4K; - page_vec->pages[cur_page] = page; - cur_page++; + /* address of the first page in the contiguous chunk; + masking relevant for the very first SG entry, + which might be unaligned */ + page = chunk_start & MASK_4K; + do { + page_vec->pages[cur_page++] = page; page += SIZE_4K; - } else - page = first_addr; - - for (; page < last_addr; page += SIZE_4K) { - page_vec->pages[cur_page] = page; - cur_page++; - } - + } while (page < end_addr); } + page_vec->data_size = total_sz; iser_dbg("page_vec->data_size:%d cur_page %d\n", page_vec->data_size,cur_page); return cur_page; } -#define IS_4K_ALIGNED(addr)unsigned long)addr) & ~MASK_4K) == 0) /** * iser_data_buf_aligned_len - Tries to determine the maximal correctly aligned @@ -284,42 +276,40 @@ static int iser_sg_to_page_vec(struct is * the number of entries which are aligned correctly. Supports the case where * consecutive
Re: [PATCH v3] [RFC] rdma/cm: support option to allow manually setting IB path
Sean Hefty wrote: Jason and Or, does this seem ready to queue for 2.6.33? Roland, I have missed your email last week, anyway, as I wrote Sean earlier, I'm totally fine with this patch of allowing user space to set a patch record for the kernel. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] rdma/cm: support option to allow manually setting IB path
Sean Hefty wrote: Future changes to the rdma cm can expand on this framework to support the full range of features allowed by the IB CM, such as separate forward and reverse paths and APM Sean, Before enhancing the rdma-cm to support the full feature set of the IB CM, something which I personally don't see the actual need for (but I will be happy to get educated what applications will or can migrate to rdma-cm once this is implemented), how about trying to allow for reduced QoS scheme also when the entity that resolved this patch didn't consulted with the SA? IB QoS is based on the query providing the tuple and the SA returning a QoS tuple. Now I'd like to see how can we let the application / querying middleware to take advantage of the knowledge on what partition it runs and use the SL associated with the IPv4 (e.g AF_INET rdma-cm ID's) IPoIB broadcast group. This way, one can still program a QoS scheme at the SA which is based on partitions. Looking on mckey, the user space code (e.g ACM), could just do rdma_bind to an IP address of an IPoIB NIC that uses this partition and then rdma_join to an unmapped multicast address which correspond to the broadcast group, take the SL and leave the group, makes sense? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] librdmacm/mckey: enforce local binding for unmapped multicast addresses
enforce local binding is specified for unmapped multicast addresses, otherwise mckey crashes when attempting to use the cma_id->verbs pointer in the port query verb. Signed-off-by: Or Gerlitz Sean, using unmapped multicast addresses I see that a different broacast group is created by the SM such that mckey doesn't manage to join the ipv4 broadcast group $ ./mckey -M ff12:401b::0:0:0:: -b 10.10.5.62 -p 0x2 mckey: joined dgid: ff12:401b::: mlid c00b sl 0 looking in the SA, I see that the MGID used by the rdma-cm is a bif different from the one used by IPoIB, since the former uses/set only the lower 28 bits where the latter sets the lower 32 bits for this mgid, any idea what can be done here? $ saquery $THIS_NODE_LID MCMemberRecord group dump: MGIDff12:401b:::: Mlid0xC000 Mtu.0x84 pkey0x Rate0x83 SL..0x0 MCMemberRecord group dump: MGIDff12:401b:::fff: Mlid0xC00B Mtu.0x84 pkey0x Rate0x83 SL..0x0 Index: librdmacm/examples/mckey.c === --- librdmacm.orig/examples/mckey.c +++ librdmacm/examples/mckey.c @@ -273,7 +273,7 @@ static int join_handler(struct cmatest_n char buf[40]; inet_ntop(AF_INET6, param->ah_attr.grh.dgid.raw, buf, 40); - printf("mckey: joined dgid: %s\n", buf); + printf("mckey: joined dgid: %s mlid %x sl %d\n", buf, param->ah_attr.dlid, param->ah_attr.sl); node->remote_qpn = param->qp_num; node->remote_qkey = param->qkey; @@ -556,6 +556,11 @@ int main(int argc, char **argv) } } + if (unmapped_addr && !src_addr) { + printf("unmapped multicast address requires binding to source address\n"); + exit(1); + } + test.dst_addr = (struct sockaddr *) &test.dst_in; test.connects_left = connections; -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash in bonding
Pradeep Satyanarayana wrote: This crash was originally reported against Rhel5.4. However, one can recreate this crash quite easily in OFED-1.5 too. I understand that you get the crash when working with the RHEL5.4 bonding driver, correct? does it happen only with IPoIB devices acting as the bonding slaves or also with Ethernet devices? Please note that with RHEL 5.4 there's no need to use the ofed provided bonding module, more over, I believe that the distro provided one is more stable and uptodate in this case. Moving forward, ofed bonding support for newish distributions is to be removed. Moni, any reason to support bonding/EL 5.4 in ofed? Or. The steps to recreate the crash are as follows: 1. Run traffic (I used ping) on the IB interfaces through the bond master 2. ifdown ib0 3. ifdown ib1 4. modprobe -r ib_ipoib -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] librdmacm/mckey: enforce local binding for unmapped multicast addresses
Sean Hefty wrote: Unmapped multicast groups only support the case where the SA has created the group with the MGID undefined. The MGID must be in this format: 0xff1 scope 0xA01B (see figure 196 on page 928 of the spec). The kernel checks for this specific address format to see if it needs to convert the address or not [...] wanted the ability to create a group a get back a unique group ID I am still not sure to follow you. My basic thought was that unmapped multicast addresses are MGIDs specified by the application such that rdma-cm doesn't treat them as IPv6 multicast address and no mapping is applied on them. From the spec location you have pointed me I understand that the intention is for a request to the SA to generate a unique MGID: 1. "if SA receives a request to create a multicast group with the MGID undefined" 2. "the MGID that it creates shall be of the following format" so there are two parts here, 1st request the SA to create a new group, assign it an MGID (what about joining this node/port to the group), 2nd, getting back the MGID created by the SA. Looking on the rdma-cm kernel code, I don't see where/how it specifies to the SA that the MGID is undefined? shouldn't it not set the MGID bit in the component mask in this case? next, I don't see where the MGID created by the SA is given back to the application. I guess still miss something here, can you clarify, thanks Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 19/25] mlx4: Randomizing mac addresses for slaves
On Wed, Nov 4, 2009 at 10:04 PM, Roland Dreier wrote: >> +#define MLX4_MAC_HEAD 0x2c900ULL > Is this a good idea? You're basically choosing 24 random bits within your > OUI... > seems the chance of collision with another MAC used on the same network is > high enough that it could easily happen in practice on a moderately big > network. yes, this has been brought by Stephen and others on this last back on September 11th, this year @ http://marc.info/?l=linux-netdev&m=125263488409128 > Can you pick a reserved range or something? Using different OUI for the VF device wouldn't help either I think, since the #VF becomes fairly big even on a modest side cluster with (say) a VM consuming VF per 1-2 cores. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] librdmacm/mckey: enforce local binding for unmapped multicast addresses
Sean Hefty wrote: I merged this with your other patch to mckey and applied them to my tree I don't see this @ http://www.openfabrics.org/git/?p=~shefty/librdmacm.git, were you referring a local clone? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QoS in local SA entity
Sean Hefty wrote: I wasn't trying to limit how the SA could 'distribute' QoS information to the end nodes. ACM will obtain QoS information from the SA when it joins its multicast groups excellent... still, this is dependent on how the ACM MGIDs are constructed, I'll take a look on the code. ACM is intended to be a service that's used by the librdmacm to resolve address mappings and routes. Trying to have ACM use the librdmacm ends up with a circular dependency. That's the part I'm trying to avoid. fail-enough, I believe that my suggestion is doable also without circular dependency, e.g as you indicated below or with a fairly small enhancement of librdmacm, see next ACM uses address mappings as defined in an address configuration file (IP -> device, port, pkey). The address file can be created using the provided ib_acme utility, which uses the current system configuration (in an ugly way, but it works). I think this provides QoS behavior similar to what you're describing I assume you are referring to an IP local to the system where ACM runs on correct? this would work well for applications calling rdma_bind and/or rdma_resolve_address while specifying a source address. To support also the case of application which do neither of these two, that is call rdma_resolve_addr with dest address only, I suggest to enhance librdmacm-calling-ACM flow and resolve the source address using route lookup from user space, next the librdmacm can issue rdma_bind on behalf of this ID and you have the triplet at your hand so now the ACM call can be made form librdmacm. Writing this, I realized that better(should) be done also for apps _resove_addr with src ip specified. This way you have unified flow for the ACM use in librdmacm for either of apps A,B,C below A.1 rdma_bind(src=X) A.2 rdma_resolve_addr(src=null, dst=Y) B.1 rdma_resolve_addr(src=null, dst=Y) C.1 rdma_resolve_addr(src=X, dst=Y) where librdmacm calling-ACM flow is L1. compute source address L2. issue kernel rdma_bind to source address and resolve pkey> L3. issue ACM address (DGID) resolution call using (pkey>, dest-ip) makes sense? if yes, what's the need in the address configuration file? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QoS in local SA entity
Jason Gunthorpe wrote: The entire point of the rdma_getaddrinfo + AF_IB is to avoid hacking up librdmacm for every address lookup/cache scheme someone invents the entire simple point I am trying to make is that rdma_getaddrinfo + AF_INET is doable, is simple and is needed to keep up the essence of the rdma-cm. I don't see how AF_IB buys anything to anyone that but if you want to push it up as long as AF_INET is first and most supported/interoperable future/present go and add your bits. As you indicated the route lookup I was mentioning could be done in rdma_addrinfo, sure with &res including both source and destination addresses. No rdma_resolve_addr2 is needed the one that exists now has source addresses specified, I don't see that extra info is needed for AF_INET that was resolved with rdma_getaddrinfo is this AF_IB specific? I don't see why the app should bother on calling rdma_getaddrinfo, it can be done by librdmacm with rdma_getaddrinfo having multiple modules as you suggested. I am in favor of the approach suggested by Sean of librdmacm either doing its native flow or under environment variable doing an alternative flow, where your suggestion not to have the 2nd flow being tightly coupled with ACM, e.g through using get_addrinfo abstraction and friends makes sense (yes!) Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] ib/iser: re-write SG handling for rdma logic
This patch re-writes the logic that does the above, to make it clearer and simpler. It also fixes a bug in the being aligned for rdma checks, where a "start" check wasn't done but rather only "end" check. Roland, I don't see this patch in your for-next branch, any reason not to merge this? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QoS in local SA entity
Sean Hefty wrote: [...] The current implementation of ACM converts this to: ** Source sends a multicast request to destination IP ** Destination sends a response with IP to DGID mapping - Path record is constructed from multicast group information ACM needs to know what the local addresses are, so it can respond to requests for those addresses okay got it. Still, how do you see my suggestion on the unified/modified librdmacm flow (L1/L2/L3 in my email) which would be taken when working against a "DGID/Route" provider such as ACM? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QoS in local SA entity
Jason Gunthorpe wrote: The extra info in rdma_resolve_addr2 carries the IB specific path information from the rdma_getaddrinfo module to the kernel for the address pair. The entire purpose of AF_IB is to let user space tell the kernel it does not want a kernel side ND and PR query, instead user space will provide all the information. The kernel patches posted by Sean replace the ND/PR flow with a two steps process, first specifying a DGID to the kernel next specifying a PATH. My suggestion is to have a librdmacm initiated bind before the sending the DGID to the kernel, this way AF_INET would be supported perfectly under the slight limitation that the source address port, pkey> tuple would be chosen by route lookup and not by the neigh->dev that what resolved by the kernel ND. This is only when the modified flow of librdmacm is taken (e.g under user specification with environment variable etc). --If-- on top of that you want to add AF_IB, we may be able to do that, but I don't see why the whole thing should be made for AF_IB only. Think of it this way, ACM takes over the entire process of what AF_INET does in the kernel. AF_INET talks directly to the IB CM module in the kernel. Thus, it also makes sense that ACM would need to talk to IB CM directly as well. AF_IB is that direct connection. I don't agree we must state it this way. I see ACM as an alternative way for AF_INET to resolve ND/PR. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: LID reconfiguration
> One more question; I saw librdmacm which looked nice but it does not > support multi-path connections. It would eliminate a lot of code if we > could use this what are your needs? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: LID reconfiguration
Jeff Roberson wrote: > I would want a way to specify the alternate sockaddr with automatic > failover between them. Perhaps with some notification when a failover occured >From your description I still don't see what the alternate address buys you. As was suggested here, bond two IPoIB devices, use the address of the bond in your librdmacm based app and automatic HA. You get indications on failover through RDMA_CM_EVENT_ADDR_CHANGE, see rdma_get_cm_event(3) Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash in bonding
Pradeep Satyanarayana wrote: > The crash is specific to IPoIB, and does not happen with Ethernet slaves. okay > Can you explain why you plan to remove this from the newer distros? This is > indeed news to me we plan to remove bonding from --ofed-- as the distro provided bonding supports ipoib, simple as that, what isn't clear here? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RDMA] Fixup IPv6 support and IPv4 routing corner cases for RDMA CM
Sean Hefty wrote: I'll compare my final patches against the ones submitted by David to see if anything got missed Are Jason's patches a superset of David's patches? or they need to be applied and only then David's work can be re-reviewed/merged, etc? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] librdmacm/mckey: add notifications on events
add notifications on multicast error and address change events which can take place while traffic is running. Signed-off-by: Or Gerlitz Index: librdmacm/examples/mckey.c === --- librdmacm.orig/examples/mckey.c +++ librdmacm/examples/mckey.c @@ -62,6 +62,7 @@ struct cmatest_node { struct cmatest { struct rdma_event_channel *channel; + pthread_t cmathread; struct cmatest_node *nodes; int conn_index; int connects_left; @@ -319,6 +320,30 @@ static int cma_handler(struct rdma_cm_id return ret; } +static void *cma_thread(void *arg) +{ + struct rdma_cm_event *event; + int ret; + + while (1) { + ret = rdma_get_cm_event(test.channel, &event); + if (ret) { + perror("rdma_get_cm_event"); + exit(ret); + } + switch (event->event) { + case RDMA_CM_EVENT_MULTICAST_ERROR: + case RDMA_CM_EVENT_ADDR_CHANGE: + printf("mckey: event: %s, status: %d\n", + rdma_event_str(event->event), event->status); + break; + default: + break; + } + rdma_ack_cm_event(event); + } +} + static void destroy_node(struct cmatest_node *node) { if (!node->cma_id) @@ -475,6 +500,7 @@ static int run(void) if (ret) goto out; + pthread_create(&test.cmathread, NULL, cma_thread, NULL); /* * Pause to give SM chance to configure switches. We don't want to * handle reliability issue in this simple test program. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ipath now and then (was [PATCH] IB/core: export struct ib_port)
On Wed, Nov 11, 2009 at 11:06 PM, Dave Olson wrote: > And yes, the ib_ipath is being fully deprecated. The "full set" of > patches that adds ib_qib upstream will include a subset that drops > ib_ipath. All the bug fixes and feature work have been done for ib_qib It was brought up in few occasions that the ipath driver can be changed such that it becomes a software IBoE driver (e.g use packet socket with the IBoE ether type for the IB L2 emulation). If it doesn't have to serve for the qlogic HCA anymore, this transformation might be even eaiser. I wonder if its better to remove it now and maybe return it later with the new facelift or leave it till the change is done. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] librdmacm/mckey: add notifications on events
Sean Hefty wrote: > mckey is intended to be a fairly simple send/receive multicast test program. > What's the reasoning behind adding the event handling? The librdmacm examples serve for multiple purposes, among them user education on how to write rdmacm based apps and as a vehicle to test/validate/reproduce features/bugs/issues, for example a follow program claimed that she isn't sure to get a multicast error event on her application when a port goes down, so with my patch to mckey we were able to see that this event is generated and we can now do better testing. In the future mckey can be further enhanced to rejoin,etc on either of the events, makes sense? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ipath now and then (was [PATCH] IB/core: export struct ib_port)
Ralph Campbell wrote: > I don't understand what you are suggesting. > The kernel module name ib_ipath and/or directory name > drivers/infiniband/hw/ipath could be reused for some > other purpose certainly. In a 2nd thought, its better that you go and remove the hw/ipath directory, I assume the qib code could be made to serve software iboe in the same manner ipath can, just make sure to keep the IB L2 handling in separate files from the L3/L4 ones... Or -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] infiniband-diags/ibqueryerrors: Add support for PortXmitDiscardDetails
Sasha Khapyorsky wrote: I don't think this is the forum to discuss vendor bugs. no way we can commit here a fix for undocumented bug Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] ib/iser: re-write SG handling for rdma logic
Roland Dreier wrote: > I just haven't been in a merging mode lately... will start working on my > 2.6.33 queue soon So when more or less this work is going to start? it seems there are bunch of things on the plate for this cycle. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/9] ib/addr: simplify resolving IPv4 addresses
Sean Hefty wrote: > Merge resolve local/remote address resolution into a single > data flow to ensure consistent access and use of the local routing tables. Sean, I reviewed patches 1-6 & 8 and they all look fine, I will give the whole series a try later this week to further validate them. > Based on work from: > David Wilder > Jason Gunthorpe David, Jason, are you planning to test these patches as well? specifically I assume the IPv6 work should be of interest to you... Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] [PATCHv6 0/10] RDMAoE support
Eli Cohen wrote: This new series reflects changes based on feedback from the community on the previous set of patches, and is tagged v6. Previous series were posted to the openfabrics general list only. Changes from v5: 1. Bug fixes. How do you expect a reviewer to learn what were the bugs and what are the fixes and if there are bugs that are known and weren't fixed yet? is one expected to do a diff between patches? where is the listing of changes from vX for X=1,2,3,4? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/9] ib/addr: simplify resolving IPv4 addresses
> I reviewed patches 1-6 & 8 and they all look fine, I will give the whole > series > a try later this week to further validate them I tested the patch series (V2 for the patches that have it, V1 for the rest) over 2.6.32-rc5 and librdmacm-1.0.8-1.el5 covering AF_INET/PS_TCP unicast and AF_INET/PS_IPOIB multicast and bonding (operability and address-change event). I used mckey and rping, all worked fine, thanks for driving this change set, Sean. David, I'll be happy to hear how the IPv6 testing went, lets get this going. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/9] rdma/cm: fix loopback address support
Sean Hefty wrote: > I will create a new librdmacm package that corresponds with the changes I made all my testing of the patch set with librdmacm 1.0.10 and patched 2.6.32-rc5 kernel, where as I wrote you, I was focusing on AF_INET/PS_TCP and AF_INET/PS_IPOIB. I understand that Dave was covering AF_INET6/PS_TCP with plenty of the ipv6 variations. So what will this new librdmacm package will let cover which wasn't possible so far? do you refer to ipv6 support in mckey? anything else? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/9] rdma/cm: fix loopback address support
> Changes were your changes to mckey, plus changes Dave added to cmatose to > support IPv6. The actual library itself hasn't been modified. okay, got it. I was under the impression that mckey still misses an option to get from the user an ipv6 multicast address which isn't all zeros nor unmapped, correct? or the -m option will work with both ipv4 and ipv6 addresses? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMAoE verbs questions
Jeff Squyres wrote: I was reviewing Mellanox's Open MPI patches for RDMAoE support Hi Jeff, Can you send us point to the patch series (mail thread or some repository where they sit)? 1. It looks like there is a new field on the ibv_port_attr struct: transport. Is it expected that all device drivers will start filling in this value, or is it done in the OF core code somewhere? Please note that this field isn't present in the distro provided IB stack and hence it is highly recommended to avoid referring it in your code, as least some of us (...) are for decoupling ompi from ofed, so lets not put sticks in the wheels of that process. the Open MPI RDMAOE patch implies that host loopback is not supported in RDMAOE mode (but it is in IB mode). To be clear, the OMPI code had to do something different for real IB vs. RDMAOE in at least 1 or 2 places Liran, where this limitation comes from? isn't the HCA supporting bridging (loopback connections) for RDMAoE? if this is the case maybe you should add a device capability to mark that. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMAoE verbs questions
Jeff Squyres wrote: Here's one thread: http://www.open-mpi.org/community/lists/devel/2009/11/7063.php Jeff, looking on the threads you have sent, I didn't find a way to download the patch in a form which can be applied on a source tree, is there a way to do it through this archive? are these patches available from some git tree @mellanox or elsewhere? does anyone have the email address of Vasily Philipov (/vasily_at_[hidden]/), if yes, can you op Pasha please ask him to send me or better, this list the proposed patch, many thanks. Or -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMAoE verbs questions
Pavel Shamis (Pasha) wrote: The patch is attached Thanks, this patch basically replaces checks for the device transport type to be IB to a check that makes sure either the former happens or the port transport type is rdmaoe. As Jason, Tziporet and noted, the port transport type seems to be bad and non-comapatible/operable idea, so it should and probably could be avoided. I see another patch @ http://www.open-mpi.org/community/lists/devel/2009/11/7063.php can you send that one as well. The you sent patch isn't signed so I can't address the author in further replies (unless you are the author), also it wasn't generated with the -p option of diff which would show for each change what is the effected function, doing so would help in the review. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMAoE verbs questions
Pavel Shamis (Pasha) wrote: > The only reason for this changes is the fact that for IB devices we > prefer to use our own open mpi connection managers. In case if we will > decide to use RDMA-CM for all devices the number of changes will be zero... whatever, currently, this change is still there, and best if you remove it and find another way to set this predicate. > So we decided to use the current ompi code as is, in future maybe we will > implement own ompi rdmacm code that will not have all this work around flows. just to make sure I am with you, all in all, only one patch is proposed to ompi for rdmaoe support and is the patch which we discuss above, this patch does three things: 1. changes BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB to look on the port transport type 2. if the port transport is rdmaoe don't run loopback connections on IB 3. some change in the qp destroy logic 4. that's it... correct? can you comment on #2? why loopback connections aren't supported? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reliable IB connections (RC) and event ordering
Roland Dreier wrote: > The IBA takes into account this lack of ordering in multiple places -- > defining > "communication established" async events, etc. same goes for the IB stack... e.g take a look on the ib_cm_notify and rdma_notify APIs Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMAoE verbs questions
Liran Liss wrote: from an rdmacm app's point of view - there is no visible difference between IB and RDMAoE ports: both support the complete set of Verbs, just as any IB transport provider wrong, local (loopback) communication aren't supported with RDMAoE. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMAoE verbs questions
Paul Grun wrote: Why do you say that Or? I said that b/c the latest patch set posted by Mellanox doesn't support loopback, I hear now that this was a temporal limitation which will be removed, let it be. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QoS settings not mapped correctly per pkey ?
Yevgeny Kliteynik wrote: > " It looks like in 'datagram' mode, the SL weights > do not seem to be applied, or maybe this is an > artifact of IPoIB in 'datagram mode' " yes, there's no reason for connected mode to behave differently wrt to QoS/SL assignment from the SM, as both modes get their SL from the path record provided by the SM and both mode use the same code for the path query... > Have you checked that in this mode you do get the right > SL for each child interface by shutting off the relevant > SL (mapping it to VL15)? seeing what SL is provided by the SM in return to the path query is trivial, either through the opensm logs or the ipoib ones, e.g here you see that ib1 got SL 0 on its Path to GID fe80::::0008:f104:0399:3c92 LID 0x0006 which is 10.10.0.91 > ifdown ib1 > echo 1 > /sys/module/ib_ipoib/parameters/debug_level > ifup ib1 > ping 10.10.0.91 > dmesg | grep ib1 > ib1: Start path record lookup for fe80::::0008:f104:0399:3c92 MTU > > 0 > ib1: PathRec LID 0x0006 for GID fe80::::0008:f104:0399:3c92 > ib1: Created ah 81021ddda180 > ib1: created address handle 81021ddda500 for LID 0x0006, SL 0 > # ip neigh show dev ib1 > 10.10.0.91 lladdr 80:00:00:49:fe:80:00:00:00:00:00:00:00:08:f1:04:03:99:3c:92 > REACHABLE -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: InfiniBand/RDMA merge plans
Roland Dreier wrote: > Since 2.6.31-rc8 has been out more than a week already, it's probably > a good time to talk about 2.6.32 merge plans. All the pending things > that I'm aware of are listed below. Hi Roland, any update on the 2.6.33 merge plans? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/11] RDMA/nes: abnormal listener termination causes loopback node crash
Faisal Latif wrote: when listener is destroyed for loopback connection Does the upstream iwarp stack supports loopback connections? does it apply to all vendors? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: InfiniBand/RDMA merge plans for 2.6.33
Eli Cohen wrote: - IBoE. In principle I think this is starting to get there. Still want to see better ABI compatibility at least, and also make sure the interface chosen works for both rdmacm and non-rdmacm applications. Based on this, I am going to send a new patch set, a few days after 2.6.33-rc1 is out Eli, here are some more issues which should be on the table and you might want to look at before posting a new version of the patches (or else if you want to handle them down the road of the review process that's fine) - loopback support , Liran commented that this works, does this mean only firmware fix is needed? - below-the-cover-addr-resolve-in-create-AH flow races e.g https://bugs.openfabrics.org/show_bug.cgi?id=1866 - L2 Ethernet integration for rdma-cm based apps, namely at minimum have the gang to comply with packets sent by the network stack for the same IP route. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMAoE / lossless Ethernet (ewg: SC'09 BOF - Meeting notes)
Liran Liss wrote: >> all the rdmaoe materials saying the lossless traffic class is a must, are you saying that this works well also >> without it? then why from architect point of view you have posed this requirement? lossless traffic can be achieved today using global pause, for example. PFC is still important; we will submit initial patches that support it next wee Liran, I would say that OTOH global pause isn't the way to go and OTHO IB RC functions quite bad when many packets are lost. As such RDMAoE without PFC and mapping priorities into TCs (the Ethernet VLs) isn't really for production, for any non trivial environment involving more then one hop. Also, this email is from one month ago, any news on the patches? Yevgeny, I took a look, and there are patches to support pfc for the mlx4_en driver, but they were never submitted upstream, which means that even if rdmaoe goes upstream, mainline users will not be able even to really test it. Also, the pfc in these patches configuration seems to be done with sysfs and not through the Netlink APIs defined in include/net/dcbnl.c, did you had any specific reason not to integrate with the mainline method of pfc/tc configuration? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] IB/mlx4: fix post_recv wq overflow check
the post recv flow should check wq overflow using the recv and not the send cq Signed-off-by: Or Gerlitz diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 989555c..2a97c96 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -1752,7 +1752,7 @@ int mlx4_ib_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *wr, ind = qp->rq.head & (qp->rq.wqe_cnt - 1); for (nreq = 0; wr; ++nreq, wr = wr->next) { - if (mlx4_wq_overflow(&qp->rq, nreq, qp->ibqp.send_cq)) { + if (mlx4_wq_overflow(&qp->rq, nreq, qp->ibqp.recv_cq)) { err = -ENOMEM; *bad_wr = wr; goto out; -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMAoE / lossless Ethernet (ewg: SC'09 BOF - Meeting notes)
Roland Dreier wrote: > I agree that implementing DCB is important for IBoE, but why do you say > that a classical ethernet fabric with global pause isn't usable? That > should be roughly equivalent to an IB fabric that uses only a single VL, > which is the case for many production IB fabrics. To start with, no matter how many data VLs are used (e.g one), all the crucial management traffic (SMPs) go on VL15 which is on the one hand lossy and on the other hand not subject to congestion when other VLs are. Now how would you manage your Cisco switch --remotely-- on a globally paused fabric when some multicast receiver hasn't had its breakfast and now slows the sender while filling the queues throughout the congestion tree where this switch is part of? To continue with, lossless is good, but to make your cluster usable under congestion, you need congestion control, that is QCN, which is designed/optimized to the case of multiple TCs. Also, IBoE can potentially find its way to much more complex environments than IB has, specifically, to clusters whose hosts are acting as hypervisors running many many VMs and the underlying fabrics does consolidates many types of traffic, globally pausing a port can dramatically reduce the efficiency of such computing center which probably was built originally to increase efficiency. I believe that the ixgbe team well understand that, and hence their continued DCB efforts can make the combination of RXE with Niantic/ixgbe very intresting to test. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMAoE / lossless Ethernet (ewg: SC'09 BOF - Meeting notes)
Paul Grun wrote: > there doesn't appear to be an argument in favor of requiring DCB with RoCEE Interesting, the ofa server is down now, so I don't have access to ofa IBoE materials, from my memory I recall that in ALL of them you have made the IBoE/CEE bundling very clear & evident, e.g this IBTA presentation made to T11 @ http://www.t11.org/ftp/t11/pub/fc/study/09-543v0.pdf Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMAoE / lossless Ethernet (ewg: SC'09 BOF - Meeting notes)
Roland Dreier wrote: Sure, DCB is very useful, in many environments. And maybe even a requirement sometimes. I'm simply trying to say that IBoE with classical ethernet is at least as useful as standard IB in many cases Roland, Paul, Putting a side for a moment the detailed discussion we've started and looking on the concluding remarks you have made, I wasn't sure to follow: if DCB isn't available (even from a silly reason of hw supporting pfc but patches not being pushed to the kernel...) what you think would function better (or function at all) for IBoE, lossy or globally paused Ethernet? I haven't managed so far to convince you that both aren't applicable for IBoE, but I also didn't manage to see what are you suggesting in the absence of DCB. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMAoE / lossless Ethernet (ewg: SC'09 BOF - Meeting notes)
Liran Liss wrote: I second... fair-enough, so now (A) everyone agrees that DCB is good for IBoE and (B) mlx4 supports pfc, any reason not to push the pfc patches into the kernel and have mlx4_en comply with the mainline dcbnl code? The only way an end-node can cause congestion is if its internal buses don't match the IB link's BW, but this is unrelated to (lack of) transport-level flow control. thanks for clarifying this Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/mlx4: fix post_recv wq overflow check
Roland Dreier wrote: > thanks, applied. With this not being a regression, I see that it went into your for-next branch and as such I assume will be available by 2.6.34. Are you fine with the patch going into the -stable series? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/mlx4: fix post_recv wq overflow check
Roland Dreier wrote: Actually I was planning on sending it for 2.6.33, since it's so small and obvious and we're reasonable early in the cycle. Not sure about -stable though -- has this been hit in practice? I agree that it should go into 2.6.33, since its so small there's no reason to wait for 2.6.34. As for the being hit question: note that without there is both bug in the overflow check and creation of extra contention between the post recv and poll send cq flows, for ULPs that have their send cq different from the recv cq, e.g IPoIB, I came a cross this bug when reviewing the mlx4 posting code when during some profiling. I wonder if the overflow check could be removed all together and be left to the ULP (kernel is trusted environment...) is there any risk in doing so? this way the WR posting code will not experience contention with the poll WC code on the CQ lock. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv7 4/9] ib_core: RoCEE CMA device binding
Eli Cohen wrote: > +static int cma_resolve_rocee_route(struct rdma_id_private *id_priv) [...] > + route->path_rec->hop_limit = 2; why? does this value has any specific meaning? > + route->path_rec->mtu_selector = 2; all the xxx_selector usages in this code should be transformed to be from the ib_sa.h selector enum. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ib/ipoib: remove TX moderation from the ethtool related code
As of commit f56bcd8 "IPoIB: Use separate CQ for UD send completions", there are no TX interrupts at the main code path. Change the ethtool related code to comply with this, such the users will not be misleaded to assume they can control TX interrupt moderation. Was pointed by Alex Vainman Signed-off-by: Or Gerlitz diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c index e9795f6..d10b4ec 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c @@ -55,9 +55,7 @@ static int ipoib_get_coalesce(struct net_device *dev, struct ipoib_dev_priv *priv = netdev_priv(dev); coal->rx_coalesce_usecs = priv->ethtool.coalesce_usecs; - coal->tx_coalesce_usecs = priv->ethtool.coalesce_usecs; coal->rx_max_coalesced_frames = priv->ethtool.max_coalesced_frames; - coal->tx_max_coalesced_frames = priv->ethtool.max_coalesced_frames; return 0; } @@ -69,10 +67,8 @@ static int ipoib_set_coalesce(struct net_device *dev, int ret; /* -* Since IPoIB uses a single CQ for both rx and tx, we assume -* that rx params dictate the configuration. These values are -* saved in the private data and returned when ipoib_get_coalesce() -* is called. +* These values are saved in the private data and returned +* when ipoib_get_coalesce() is called */ if (coal->rx_coalesce_usecs > 0x || coal->rx_max_coalesced_frames > 0x) @@ -85,8 +81,6 @@ static int ipoib_set_coalesce(struct net_device *dev, return ret; } - coal->tx_coalesce_usecs = coal->rx_coalesce_usecs; - coal->tx_max_coalesced_frames = coal->rx_max_coalesced_frames; priv->ethtool.coalesce_usecs = coal->rx_coalesce_usecs; priv->ethtool.max_coalesced_frames = coal->rx_max_coalesced_frames; -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
upstream mlx4/ib/4K mtu support
Hi Vlad, I came across this ofed patch which isn't upstream. Is it a must for making mlx4/ib/4K mtu working? was it rejected from upstream? why? Or. mlx4/IB: Add set_4k_mtu module parameter. It control Infiniband link MTU for all IB ports in a host. Signed-off-by: Vladimir Sokolovsky --- Index: ofed_kernel-fixes/drivers/net/mlx4/port.c === --- ofed_kernel-fixes.orig/drivers/net/mlx4/port.c 2009-11-09 02:20:06.0 +0200 +++ ofed_kernel-fixes/drivers/net/mlx4/port.c 2009-11-09 02:21:46.0 +0200 @@ -37,6 +37,10 @@ #include "mlx4.h" +int mlx4_ib_set_4k_mtu = 0; +module_param_named(set_4k_mtu, mlx4_ib_set_4k_mtu, int, 0444); +MODULE_PARM_DESC(set_4k_mtu, "attempt to set 4K MTU to all ConnectX ports"); + #define MLX4_MAC_VALID (1ull << 63) #define MLX4_MAC_MASK 0xULL @@ -308,6 +312,9 @@ memset(mailbox->buf, 0, 256); + if (mlx4_ib_set_4k_mtu) + ((__be32 *) mailbox->buf)[0] |= cpu_to_be32((1 << 22) | (1 << 21) | (5 << 12) | (2 << 4)); + ((__be32 *) mailbox->buf)[1] = dev->caps.ib_port_def_cap[port]; err = mlx4_cmd(dev, mailbox->dma, port, 0, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B); -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA Read sge errors
Jack, I see now that commit cd155c1 "IB/mlx4: Fix creation of kernel QP with max number of send s/g entries" is mainstream but not ofed 1.4.x and that mlx4_0090_fix_sq_wrs.patch (below) is in ofed but not mainstream, was it rejected from the mainline kernel? why? Or. 1. Limit qp resources accepted for ib_create_qp() to the limits reported in ib_query_device(). In kernel space,make sure that the limits returned to the caller following qp creation also lie within the reported device limits. For userspace, report as before, and do adjustment in libmlx4 (so as not to break ABI). 2. Limit max number of wqes per QP reported when querying the device, so that ib_create_qp will never fail due to any additional headroom WQEs allocated. Signed-off-by: Jack Morgenstein --- drivers/infiniband/hw/mlx4/main.c|2 +- drivers/infiniband/hw/mlx4/mlx4_ib.h |7 +++ drivers/infiniband/hw/mlx4/qp.c | 25 +++-- 3 files changed, 27 insertions(+), 7 deletions(-) Index: ofed_kernel/drivers/infiniband/hw/mlx4/main.c === --- ofed_kernel.orig/drivers/infiniband/hw/mlx4/main.c +++ ofed_kernel/drivers/infiniband/hw/mlx4/main.c @@ -122,7 +122,7 @@ static int mlx4_ib_query_device(struct i props->max_mr_size = ~0ull; props->page_size_cap = dev->dev->caps.page_size_cap; props->max_qp = dev->dev->caps.num_qps - dev->dev->caps.reserved_qps; - props->max_qp_wr = dev->dev->caps.max_wqes; + props->max_qp_wr = dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE; props->max_sge = min(dev->dev->caps.max_sq_sg, dev->dev->caps.max_rq_sg); props->max_cq = dev->dev->caps.num_cqs - dev->dev->caps.reserved_cqs; Index: ofed_kernel/drivers/infiniband/hw/mlx4/mlx4_ib.h === --- ofed_kernel.orig/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ ofed_kernel/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -44,6 +44,13 @@ #include #include +enum { + MLX4_IB_SQ_MIN_WQE_SHIFT = 6 +}; + +#define MLX4_IB_SQ_HEADROOM(shift) ((2048 >> (shift)) + 1) +#define MLX4_IB_SQ_MAX_SPARE (MLX4_IB_SQ_HEADROOM(MLX4_IB_SQ_MIN_WQE_SHIFT)) + struct mlx4_ib_ucontext { struct ib_ucontext ibucontext; struct mlx4_uar uar; Index: ofed_kernel/drivers/infiniband/hw/mlx4/qp.c === --- ofed_kernel.orig/drivers/infiniband/hw/mlx4/qp.c +++ ofed_kernel/drivers/infiniband/hw/mlx4/qp.c @@ -289,8 +289,9 @@ static int set_rq_size(struct mlx4_ib_de int is_user, int has_srq, struct mlx4_ib_qp *qp) { /* Sanity check RQ size before proceeding */ - if (cap->max_recv_wr > dev->dev->caps.max_wqes || - cap->max_recv_sge > dev->dev->caps.max_rq_sg) + if (cap->max_recv_wr > dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE || + cap->max_recv_sge > + min(dev->dev->caps.max_sq_sg, dev->dev->caps.max_rq_sg)) return -EINVAL; if (has_srq) { @@ -309,8 +310,19 @@ static int set_rq_size(struct mlx4_ib_de qp->rq.wqe_shift = ilog2(qp->rq.max_gs * sizeof (struct mlx4_wqe_data_seg)); } - cap->max_recv_wr = qp->rq.max_post = qp->rq.wqe_cnt; - cap->max_recv_sge = qp->rq.max_gs; + /* leave userspace return values as they were, so as not to break ABI */ + if (is_user) { + cap->max_recv_wr = qp->rq.max_post = qp->rq.wqe_cnt; + cap->max_recv_sge = qp->rq.max_gs; + } else { + cap->max_recv_wr = qp->rq.max_post = + min(dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE, qp->rq.wqe_cnt); + cap->max_recv_sge = min(qp->rq.max_gs, + min(dev->dev->caps.max_sq_sg, + dev->dev->caps.max_rq_sg)); + } + /* We don't support inline sends for kernel QPs (yet) */ + return 0; } @@ -321,8 +333,9 @@ static int set_kernel_sq_size(struct mlx int s; /* Sanity check SQ size before proceeding */ - if (cap->max_send_wr > dev->dev->caps.max_wqes || - cap->max_send_sge> dev->dev->caps.max_sq_sg || + if (cap->max_send_wr > (dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE) || + cap->max_send_sge> + min(dev->dev->caps.max_sq_sg, dev->dev->caps.max_rq_sg) || cap->max_inline_data + send_wqe_overhead(type, qp->flags) + sizeof (struct mlx4_wqe_inline_seg) > dev->dev->caps.max_sq_desc_sz) return -EINVAL; -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel
Re: [PATCH 1/3] rdma_cm: Add support for a new RDMA_PS_LUSTRE Lustre port space
sebastien dugue wrote: > That can be done with port numbers, except that we cannot separate > traffic to Lustre MDS and traffic to Lustre OSS Looking on these patches and going with you for a minute, I don't see how this patch set serves you to assign a different QoS level (e.g SL) to MDS vs OSS related traffic. Can you elaborate on that a bit? Sean Hefty wrote: > Can't this be done using port numbers in the existing port space? Indeed, Sebastien what prevents you from using the TCP port space, with one port used for MDS traffic and another port for OSS traffic? how does Lustre get ports to listen on, are they well known or you call bind with port zero and use the port allocated by the rdma-cm? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv7 7/9] ib_core: Add API to support RoCEE from userspace
Eli Cohen wrote: > Add ib_uverbs_get_mac() to be used by ibv_create_ah() to retirieve the remote > port's MAC address from the remote port's GID. Port link layer is also > returned > by ibv_query_port() why can't all this be implemented within libibverbs? looking on mlx4's implementation of ib_get_mac, it reduces to calling rdma_get_ll_mac, a two liner inline function which does the translation. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv7 4/9] ib_core: RoCEE CMA device binding
Eli Cohen wrote: > The other place is IPoIB:path_rec_completion() where we need not require > GRH since IPoIB over RoCEE is disable please note that can't assume that IPoIB need not use GRH, as at some future point this code can operate across IB subnets, for couple of years patches to allow for supporting that are merged into the code, e.g see 46f1b3d7 "IB/ipoib: Use ib_init_ah_from_path to initialize ah_attr" Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
clarification on the mlx4 CQE structure
Hi Yevgeny, looking on commit f780a9f "mlx4_core: Add ethernet fields to CQE struct" I see the following two changes: @@ -692,14 +692,13 @@ repoll: - wc->sl = cqe->sl >> 4; + wc->sl = be16_to_cpu(cqe->sl_vid >> 12); I wasn't sure if/why a conversion from network order to host order is neeed here, can you clarify that? Or. @@ -39,17 +39,18 @@ struct mlx4_cqe { - __be32 my_qpn; + __be32 vlan_my_qpn; - u8 sl; - u8 reserved1; + __be16 sl_vid; -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: clarification on the mlx4 CQE structure
Yevgeny Petrilin wrote: > This commit has an endianess bug, that was fixed in commit f781a22f. > The cqe->sl_vid field is a be16, so we needed to convert the sl value to > host order. Before the commit this field was two u8 fields, so no conversion > was needed okay, got it, thanks Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/mlx4: fix post_recv wq overflow check
Roland Dreier wrote: I do think it is quite common to see this WQ overflow check trigger, even for kernel code mmm, why is that common? typically there's a higher layer to which the IB ULP advertises some sort of maximal number of credits (e.g in the SCSI case, iser and srp specify the maximal number of commands in the scsi host template) or the ULP informs a higher layer that no more sends can be done (e.g IPoIB calling netif_stop_queue once it sense that the QP filled, etc). Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] rdma_cm: Add support for a new RDMA_PS_LUSTRE Lustre port space
sebastien dugue wrote: So I guess you need to change the ports used within the new port space -- but then why can't you just stay in the TCP space but change the ports used? No, with the new port space, there's no need to change ports. You only need to specify the target GUIDs. For example: lustre, target-portguid 0x1234,0x1235 : 1 # lustre traffic to MDSs lustre: 2 # default lustre traffic (to OSSs) Hope this helps clarify things a bit. sorry, but it doesn't, as far as I understand there are three possibilities for what the string "lustre" is being translated to by the opensm QoS logic: (A) lustre port in the TCP port space (B) lustre port space (C) nothing (that is not a service, in the same manner that ipoib just doesn't mean anything to opensm) Assuming C is not the case, then either A or B will yield the same result and as such the new port space buys you nothing. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] rdma_cm: Add support for a new RDMA_PS_LUSTRE Lustre port space
sebastien dugue wrote: > No, because in OpenSM's QoS logic, there's no way to map the TCP port > space with specific target GUIDs onto an SL. You have keywords for SDP, SRP, > RDS, ISER, ... but not for the TCP port space (or am I missing something?). going with this, what prevents you from patching opensm qos engine to support the lustre service under the tcp port-space and/or support a combination of service and target port-guid? all in all, first, I don't see what a kernel patch buys you and second, if it buys you something you should be able to gain the same effect with patching open-sm. thinking on this a bit more, since the rules are processed by order wouldn't the following scheme let you achieve the same effect? target-portguid 0x1234,0x1235 : 1 # traffic to MDSs lustre: 2 # default lustre traffic (to OSSs) Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/mlx4: fix post_recv wq overflow check
Roland Dreier wrote: In other words this check catches common bugs and makes them a gazillion times easier to find and fix. So unless the performance impact is extreme, I'm inclined to leave it okay, lets leave this like that for unless someone comes with performance data that shows this is really a bottleneck. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ib/ipoib: remove TX moderation from the ethtool related code
Or Gerlitz wrote: As of commit f56bcd8 "IPoIB: Use separate CQ for UD send completions", there are no TX interrupts at the main code path. Change the ethtool related code to comply with this, such the users will not be misleaded to assume they can control TX interrupt moderation. Hi Roland, did you had the chance to look on this one? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rdma_bind failure over iWarp
Woodruff, Robert J wrote: > [wo...@det-17 src]$ ucmatose -b 192.168.0.17 > cmatose: starting server > cmatose: bind address failed: No such file or directory > return status -1 A case were rdma_bind returns -ENOENT was debugged here this week with the problem being the same IP assigned to two interfaces where one of them not being of a HCA/RNIC. I just tried assigning the same IP to on-board 1Gbs and IB HCA and couldn't hit the ucmatose error (2.6.33-rc4 and librdmacm-1.0.8-5.el5). Moni, anything you can add? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] rdma_cm: Add support for a new RDMA_PS_LUSTRE Lustre port space
sebastien dugue wrote: > OK, then going with the TCP port space, what we need in OpenSM is a > combination of service id (TCP) _and_ TCP port _and_ target GUID. I believe that you can have a 'lustre' keyword in opensm qos parser which stands for the combination of tcp port space + lustre tcp port (maybe it exists now), so in the policy file this would translate to X,{Z1,Z2,..,Zm} (as was in your example) and not to X,Y,{Z1,Z2,..,Zm}. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ibv_asyncwatch and buffering
Håkon Bugge wrote: That would make ibv_asyncwatch more useful in scripted environments patch? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ib_write_bw hanging when using max max_inline value
Håkon Bugge wrote: the test program hangs when exchanging 920 bytes [...] the creation of QP goes OK attaching a debugger is typically helpful to see where a program talking directly to the hardware hangs. If it happens on the slow pass, strace can be useful as well. Did you take a look on the actual values set for this qp, that it as suggested by ibv_create_qp(3) look on the init attributes after the function returns. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ib/ipoib: remove TX moderation from the ethtool related code
Roland Dreier wrote: Yes, looks fine, planning to merge it for 2.6.34 okay, good, I see that the for-next branch of yours is updated and already contains one patch. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libibverbs: Force line-buffering in ibv_asyncwatch
Håkon Bugge wrote: > I used the information at > www.openfabrics.org/git/?p=ofed_1_2_5/libibverbs.git;a=summary > which states the "owner" to be Vlad. May be that confused me. I'll send a > copy to Roland Roland's user space git trees are all hosted @ kernel.org the libibverbs one is git://git.kernel.org/pub/scm/libs/infiniband/libibverbs.git you can find there the libmlx4 and libmthca ones as well. Vlad - is there a way to prevent such confusion in the future, maybe put a clear comment in the header of the ofa git page? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ib_write_bw hanging when using max max_inline value
Håkon Bugge wrote: The capabilities in qp_init_attr used as input to ibv_create_qp() are: max_send_sge = 1, max_recv_sge = 1, max_inline_data = 928 Upon return the capabilities are modified to the following max_send_sge = 32, max_recv_sge = 1, max_inline_data = 928 Note decreasing the size of the RDMA to 912 bytes, the program works Jack, sounds like this use case hits the bug/s you were attempting to solve with the patch set we were discussing @ http://marc.info/?l=linux-rdma&m=126330119309593 which that never made it upstream, correct? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html