Re: [PATCH 10/11] IB: only keep a single key in struct ib_mr
> From: Jason Gunthorpe> > >fill mr->key by the lkey or rkey based on that and everything will > > >work fine. > > > > But the ULP *can* register a memory buffer with local and remote > > access permissions. > Not in the new API. > > If a ULP ever comes along that does need that then they can start with > two MRs and then eventually upgrade the kapi to have some kind of > efficient bidir MR mode. ULPs are *already* using the same registrations for both local and remote access. So, currently, this is a no-go. Sorry. Let's make the required adjustments, which don't violate specs, and then continue. > > What we've seen on the list lately is that every single ULP seems to > have technical problems running the stack properly. We need to get off > this idea that the spec has to govern the kapi - that didn't lead us > any place nice. > Oh, please. Our stack connects millions of endpoints in production using multiple ULPs for the last 15 years with full interoperability because of specs. I am sure that we can keep improving our APIs while adhering to industry standards, as we have done very successfully until now. --Liran -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH for-next V2 00/11] Add RoCE v2 support
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- > ow...@vger.kernel.org] On Behalf Of Doug Ledford > These patches add the concept of duplicate GIDs that are differentiated by > their RoCE version (also called network type). So, now, an incoming packet > could match a couple different gid_indexes and we need additional > information to get back to the definitive 1:1 mapping. The submitted patches > are designed around a lazy resolution of the namespace, preferring to defer > the work of mapping the incoming packet to a unique namespace until that > information is actually needed. To enable this lazy resolution, it provides > the > network_type so that the resolution can be done. > > This is a fair assessment of the current state of things and what these > patches do, yes? Hi Doug, Yes, your description nails the crux of the discussion. Thanks. > > Jason's objections are this: > > 1) The lazy resolution is wrong. > 2) The use of network_type as the additional information to get to the > unique namespace is vendor specific cruft that shouldn't be part of the core > kernel API. > Which is totally wrong. RoCE versions and the network type attribute are part of the RoCE specification. There is nothing vendor specific. In fact, this patch initially came from Avago!!! > Jason's preference would be that the above issues be resolved by skipping > the lazy resolution and instead doing proactive resolution on receipt of a > packet and then probably just pass the namespace around instead of passing > around the information needed to resolve the namespace. Or, at a > minimum, at least make the information added to the core API not > something vendor specific like network_type, which is a detail of the > Mellanox implementation. Again, there is nothing here specific to the Mellanox implementation. This argument is invalid. > > Jason, is this accurate for your position? > > If everyone agrees that this is a fair statement of where we stand, then I'll > continue my response. If not, please correct anything I have wrong above > and I'll take that into my continued response. > Since we are following the standard, and: - This is *not* a vendor specific issue - Lazy resolution adheres to the current flow of the stack (which IMO, is the right thing to do anyway) - There are no security issues whatsoever (see below) - The code could be easily extended to support namespaces in the future - We should "release early release often", and not hold progress because of endless philosophical discussions then let's apply these patches. I perfectly understand Jason's proposal and its merits. I just don't agree that it is a better approach, and in any case this is not the time to continue discussing it. > 1 - Actually, for any received packet with associated IP address information. > We've only enabled net namespaces for IP connections between user space > applications, for direct IB connections or for kernel connections there is not > yet any namespace support. Correct. That's why we are not risking anything in the KAPI with regard to namespaces; there is no security issue. --Liran > > -- > Doug Ledford> GPG KeyID: 0E572FDD > N�r��yb�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!�i
RE: [PATCH for-next V2 00/11] Add RoCE v2 support
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- > ow...@vger.kernel.org] On Behalf Of Doug Ledford > In particular, Liran piped up with this comment: > > "Also, I don't want to do any route resolution on the Rx path. A UD QP > completion just reports the details of the packet it received. > > Conceptually, an incoming packet may not even match an SGID index at all. > Maybe, responses should be sent from another device. This should not be > decided that the point that a packet was received." > > The part that bothers me about this is that this statement makes sense when > just thinking about the spec, as you say. However, once you consider > namespaces, security implications make this statement spec compliant, but > still unacceptable. The spec itself is silent on namespaces. But, you guys > wanted, and you got, namespace support. > Since that's beyond spec, and carries security requirements, I think it's > fair to > say that from now on, the Linux kernel RDMA stack can no longer *just* be > spec compliant. There are additional concerns that must always be > addressed with new changes, and those are the namespace constraint > preservation concerns. > Hi Doug, Currently, there is no namespace support for RoCE, so the RoCEv2 patches have *nothing* to do with this. That said, the RoCE specification does not contradict or inhibit any future implementation for namespaces. The CMA will get thefrom ib_wc and resolve to a netdev (or sgid_index->netdev, whatever) and process the request accordingly. We can have endless theoretical discussions on features that are not even implemented yet (e.g., RoCE namespace support) each time we add a minor straightforward, *spec-compliant* change that *all* RoCE vendors adhere to. If someone wishes to introduce a new concept, API refactoring proposal, or similar for community review, please do so with a different RFC. This is hindering progress of the whole RDMA stack development! For example, the posted SoftRoCE patches are waiting just for this. The RoCEv2 patches have been posted upstream for review for months (!) now. I simply cannot understand why this is lagging for so long; let's start to get the wheels rolling. --Liran
RE: [PATCH for-next V2 00/11] Add RoCE v2 support
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- > > Since you and Jason did not reach a consensus, I have to dig in and > > see if these patches make it possible to break namespace confinement, > > either accidentally or with intentionally tricky behavior. That's > > going to take me some time. > > Everything to do with parsing a wc and constructing an AH is wrong in this > series, and the fixes require the API changes I mentioned ( add 'wc to gid > index' API call, add 'route to AH' API call) > > Every time you read 'route validation' - that is an error, the route should > never just be validated, it is needed information to construct a rocev2 AH. > All > the places that roughly hand parse the rocev2 WC should not be open coded. > > Even if current HW is broken for namespaces we should not enshrine that in > the kapi. > Currently, namespaces are not supported for RoCE. So for this patches, this is irrelevant. That said, we have everything we need for RoCE namespace support when we get there. All of this has nothing to do with "broken" and enshrining anything in the kapi. That's just bullshit. The crux of the discussion is the meaning of the API. The design of the RDMA stack is that Verbs are used by core IB services, such as addressing. For these services, as the specification requires, all relevant fields must be reported in the CQE, period. All spec-compliant HW devices follow this. If a ULP wants to create an address handle from a completion, there are service routines to accomplish that, based on the reported fields. If it doesn't care, there is no reason to sacrifice performance. --Liran -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- > > > You are pushing abstraction into provider code instead of handling it in a > generic way. > > No, I am defining an API that *make sense* and doesn't leak useless details. > Of course that doesn't force code duplication or anyhting like that, just > implement it smartly. > > I think mlx made a big mistake returning network_type instead of gid index, > and I don't want to see that error enshrined in our API. > > > The Verbs are a low-level API, that should report exactly what was > > received from the wire. In the RoCEv2 case, it should be the GID/IP > > addresses and the protocol type. The addressing information is not > > intended to be used directly by applications; it is the raw bits that > > were accepted from the wire. > > Low level details isn't what any in kernel consumer needs. Everything in > kernel needs the gid index to determine the namespace, routing and other > details. It is not optional. A common API is thus needed to do this > conversion. The Verbs are not intended only for kernel consumers, but also for the ib_core, cma, etc. For the ib_core, a provider needs to report *all* relevant information that is not visible in the packet payload. The network type is part of this information. The proposed changes are a straightforward extension to the code base, directly follow the specification, and adhere to the RDMA stack design in which IP addressing is handled by the cma. Also, I don't want to do any route resolution on the Rx path. A UD QP completion just reports the details of the packet it received. Conceptually, an incoming packet may not even match an SGID index at all. Maybe, responses should be sent from another device. This should not be decided that the point that a packet was received. > > API-wise, once you get the gid index then it is trivial to make easy > extractors > for everything else. ie for example: > rdma_get_ud_src_sockaddr(gid_index,,wc,grh) > rdma_get_ud_dst_sockaddr(gid_index,,wc,grh) > Nice ideas, but not relevant to completions. The resolved dst address could point to another port or device completely. The proper way to handle remote UD addresses is by rdma_ids that encapsulate address handles and bound devices. > > ib_init_ah_from_wc() and friends is exactly the place that you want to > > create an address handle based on completion and packet fields. > > CMA needs exactly the same logic as well, the fact it doesn't have it is a bug > in this series. > init_ah_from_wc() should include a route lookup, and this will be fixed. > Jason > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the > body of a message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc
> From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com] > Look, here is a concrete direction: > > Replace all the crap in > ib_init_ah_from_wc/get_sgid_index_from_eth/rdma_addr_find_dmac_by_ > grh > > with a straightforward > >rdma_dgid_index_from_wc( > const struct ib_qp *qp, > const struct ib_wc *wc, > const struct ib_grh *grh, > u16 *gid_index) > > Sort of function that reads the GRH and wc and returns the unambiguous gid > index that was used to receive that packet on the UD QP. > > The gid cache is not allowed to create an ambiguity the driver cannot resolve. > > That said, I wouldn't object to vendor-specific bits in the wc. Ie if mlx > hardware needs a network_type bit to implement > rdma_find_dgid_index_from_wc, then fine - define a vendor specific place > to put it. In this case rdma_find_dgid_index_from_wc would be a driver call > back, which is fine, and what Caitlin was talking about. > > But, it is not part of our verbs API, and I'd *strongly* encourage other > vendors and future hardware to simply return the gid index that the > hardware matched instead of requiring the software to try and guess after > the fact. > You are pushing abstraction into provider code instead of handling it in a generic way. The Verbs are a low-level API, that should report exactly what was received from the wire. In the RoCEv2 case, it should be the GID/IP addresses and the protocol type. The addressing information is not intended to be used directly by applications; it is the raw bits that were accepted from the wire. I don't want to replicate resolution code in every RoCE driver. ib_init_ah_from_wc() and friends is exactly the place that you want to create an address handle based on completion and packet fields. This is what applications will use (and only if they are interested in generating a response; otherwise this overhead can be skipped). As I said earlier, we can make sure that protocol type information doesn't incur any additional overhead. --Liran -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH for-next V1 5/9] IB/core: Add rdma_network_type to wc
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- > > The abstraction at the gid cache is making it too easy to make this mistake. > It > is enabling callers to do direct gid lookups without a route lookup, which is > unconditionally wrong. Every call site into the gid cache I looked at appears > to > have this problem. > > The simplest fix is to have a new gid cache api for rocve2 that somehow > forces/includes the necessary route lookup. The existing API cannot simply > be extended for rocev2. > I think that the GID cache should remain just that: a cache. We shouldn't bloat it. The CMA is the proper place to handle IP resolution. > > roce_gid_mgmt, is the part that populates this "dumb" database. > > IMHO, adding such a "smart" layer to the GID cache is wrong, as this > > should be part of RDMACM which does the translation. No need to get > > the gid cache involved. > > OK. Change the gid cache so only a RDMA CM private API can return > rocev2 gids. > The same cache is also used in IB and thus by other components, so it cannot be a private CM API. RoCE ULPs use the CMA for establishing connections, so route lookups should be done from there. --Liran -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH for-next V7 6/6] IB/ucma: HW Device hot-removal support
From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com] It does if you want a planned 'gental' removal to be possible.. There could be a lot of design options for a 'gentle' removal, such as first sending a 'prepare' event, and only then doing the flow proposed here. I do not want to address this now. When you do a surprise removal, you disconnect the application from both ucma and uverbs device references. In this state, the only thing that the application can do is close all handles. It doesn't matter if you succeed closing a resource on a device that is still accessible, or close an already destroyed resource... Hurm. That makese sense, but that isn't entirely what this patch set does - it is rough when facing the user, but the kernel side is actually quite gental.. Ie the RDMA CM still sends GMPs on this shutdown path.. Looks like the various storage ULP shutdowns are pretty gental too. Right. Kernel removal is always gentle, because kernel ULPs are trusted to close within a finite time. As an example, I think it is important, if you want to hot-unplug a storage controller the admin expects a clean storage shutdown. I think in-kernel storage drivers do that already. IB shouldn't really be much different. If a storage stack has user-space components, then our future 'gentle' closing will take care of that. Anyway, let's get this patch-set done. Still needs to have the locking fixed.. As pointed out (https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg27023.html), this barrier scheme is used in several places including in the RDMA stack. If there is no functional issue with the code, we can discuss how to address this scheme systematically later. --Liran -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH for-next V7 6/6] IB/ucma: HW Device hot-removal support
From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe I have no real problem with that, it would be nice to have an answer to the uverbs vs ucma removal ordering question and the basic issue of if we even want to do this so async before merging the driver patch that enables it.. I don't think that the order matters. When you do a surprise removal, you disconnect the application from both ucma and uverbs device references. In this state, the only thing that the application can do is close all handles. It doesn't matter if you succeed closing a resource on a device that is still accessible, or close an already destroyed resource... The primary reason for device surprise removal is to recover from device fatal errors or PCI errors. A quick asynchronous device recovery is exactly what we need, as there is nothing that the application can do more at this state. This is the most pressing issue that this patch-set solves. For the administrative removal use case, can consider providing applications a grace period before disconnecting them in a future patch set. Even here, however, I am not sure that if an administrator decided to remove a device while there are still active applications, we should stall his request. Anyway, let's get this patch-set done. The first 4 you already review and acked the fifth is in mlx4_ib and has no comments to fix so far - some months in the list. Personally, I'm not looking much at drivers.. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v1 08/12] IB/cma: Add net_dev and private data checks to RDMA CM
From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com] After all, it is the payload that designates the entity that you want to establish a connection to, rather than the packet headers, which are just meant to relay the packet to the proper CM No, that isn't right. The IBA uses the GMP's destination first, then serviceID as the demux. Services IDs are not globally unique, they are scoped by the destination. The destination is the physical CA port and kernel CM agent, so I don't think the answer is that clear. Going forward along these lines: - Name space lookup is done based on BTH.pkey, private_data.IP, and optionally GRH.DGID (if present, for extra validation) -- Primary and alternate paths are not considered at all - If P_Key enforcement is set up via cgroups: -- For CM processing, we only check BTH.pkey --- Upon conflict, the packet is dropped -- The primary/alternate path pkeys are not validated by CM, but will be validated during QP modification Does this sound OK? In any case, let's complete the namespace lookup first, and then follow up with a cgroup patchset. The path data is just *routing* it doesn't describe at all the entity we want to talk to, it is only a proposal for how to flow data to it. In any event, both the GMP headers and the path data needs to be checked against the container's pkey list. I don't know why this is so contentions. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v1 08/12] IB/cma: Add net_dev and private data checks to RDMA CM
From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com] What is really missing here I guess is a mechanism that would enforce containers to only use certain pkeys - perhaps with something like an RDMA cgroup. It could force containers to only use approved pkeys not only with RDMA CM, but through uverbs, and other user-space interfaces. Ideally yes. Without that it just means you can't use pkey meaningfully with containers.. That's why partition enforcement should be separated from namespaces. If you want to restrict a container to a specific set of pkeys, use cgroups. This would apply both to CM MADs and QPs. - In the MAD case, CM MADs would be first matched to a namespace and rdma_id and dropped upon pkey conflict (with either the headers or the payload). - In the QP case, modify_qp() would fail on conflict. Partitioning needs to be enforced also for applications that don't use the CM at all... For namespaces, it seems more natural to lookup the namespace based solely on the CM payload. After all, it is the payload that designates the entity that you want to establish a connection to, rather than the packet headers, which are just meant to relay the packet to the proper CM agent. (Spec-wise, it is even possible to use completely different node to open a connection on behalf of another node.) So, the sole role of pkeys in the context of namespaces is to locate the proper namespace; not for partition isolation. So, as a first step, let's get the namespace lookup in place. Next, add an rdma crgoup, which would control pkeys, as well as resource counts (you don't want one container to rob the whole system from all QPs...). Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing
From: Weiny, Ira [mailto:ira.we...@intel.com] ib_verbs define an *extensive* direct HW access API, which is constantly evolving. This is the problem with verbs... Huh? It is its strength, if you don't break backward compatibility... You cannot describe the intricate object relations and semantics through an API. In addition, you can't abstract anything or fix stuff in SW. The only way to *truly* know what to expect when performing Verbs calls is to check the node type. How can you say this? mthca, mlx4, mlx5, and qib all have different sets of functionality... all with the same node type. OPA has the same set as qib... same node type. Only that qib is IB, which is fully interoperable with mlx* ib_verbs was never only an API. It started as the Linux implementation of the IBTA standard, with guaranteed semantics and wire protocol. Later, the interface was reused to support additional RDMA devices. However, you could *always* check the node type if you wanted to, thereby retaining the standard guarantees. Win-win situation... Not true at all. For example, Qib does not support XRC and yet has the same node type as mlx4 (5)... The node type is for guaranteeing semantics and interop for the features that you do implement... This is a very strong property; we should not give up on it. On the contrary the property is weak and implies functionality or lack of functionality rather than being explicit. This was done because getting changes to kernel ABIs was hard and we took a shortcut with node type which we should not have. OPA attempts to stop this madness and supports the functionality of verbs _As_ _Defined_ rather than creating yet another set of things which applications need to check against. I totally agree that we should improve the expressiveness and accuracy of our capabilities; you don't need OPA for this. Unfortunately, it is not always the case. Also, there are behaviors that are not defined by the API, but still rely on the node type. Management applications , for example. --Liran
RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing
From: Doug Ledford [mailto:dledf...@redhat.com] But the node_type stands for more than just an abstract RDMA device: In IB, it designates an instance of an industry-standard, well-defined, device type: it's possible link types, transport, semantics, management, everything. It *should* be exposed to user-space so apps that know and care what they are running on could continue to work. I'm sorry, but your argument here is not very convincing at all. And it's somewhat hypocritical. When RoCE was first introduced, the *exact* same argument could be used to argue for why RoCE should require a new node_type. Except then, because RoCE was your own, you argued for, and got, an expansion of the IB node_type definition that now included a relevant link_layer attribute that apps never needed to care about before. However, now you are a victim of your own success. You set the standard then that if the new device can properly emulate an IB Verbs/IB Link Layer device in terms of A) supported primitives (iWARP and usNIC both fail here, and hence why they have their own node_types) and B) queue pair creation process modulo link layer specific addressing attributes, then that device qualifies to use the IB_CA node_type and merely needs only a link_layer attribute to differentiate it. No. RoCE is as an open standard from the IBTA with the exact same RDMA protocol semantics as InfiniBand and a clear set of compliancy rules without which an implementation can't claim to be such. A RoCE device *is* an IB CA with an Ethernet link. In contrast, OPA is a proprietary protocol. We don't know what primitives are supported, and whether the semantics of supported primitives are the same as in InfiniBand. The new OPA stuff appears to be following *exactly* the same development model/path that RoCE did. When RoCE was introduced, all the apps that really cared about low level addressing on the link layer had to be modified to encompass the new link type. This is simply link_layer number three for apps to care about. You are missing my point. API transparency is not a synonym for full semantic equivalence. The Node Type doesn’t indicate level of adherence to an API. Node Type indicates compliancy to a specification (e.g. wire protocol, remote order of execution, error semantics, architectural limitations, etc). The IBTA CA and Switch Node Types belong to devices that are compliant to the corresponding specifications from the InfiniBand Trade Association. And that doesn’t prevent applications to choose to be coded to run over nodes of different Node Type as it happens today with IB/RoCE and iWARP. This has nothing to do with addressing.
RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing
From: Doug Ledford [mailto:dledf...@redhat.com] OPA cannot impersonate IB; OPA node and link types have to be designated as such. In terms of MAD processing flows, both explicit (as in the handle_opa_smi() call below) and implicit code paths (which share IB flows - there are several cases) must make this distinction. As far as in the kernel is concerned, the individual capability bits are much more important. I would actually like to do away with the node_type variable from struct ib_device eventually. As for user space, We agreed on the concept of capability bits for the sake of simplifying code sharing. That is OK. But the node_type stands for more than just an abstract RDMA device: In IB, it designates an instance of an industry-standard, well-defined, device type: it's possible link types, transport, semantics, management, everything. It *should* be exposed to user-space so apps that know and care what they are running on could continue to work. The place for abstraction is in the rdmacm/CMA, which serves applications that just want some RDMA functionality regardless of the underlying technology. All SMI code has different behavior if it is running on a switch or HCA, so testing for 'switchyness' is very appropriate here. Sure... cap_is_switch_smi would be a nice refinement to let us drop nodetype. Exactly, we need a bit added to the immutable data bits, and a new cap_ helper, and then nodetype is ready to be retired. Add a bit, drop a u8 ;-) This is indeed a viable solution. I don't have a problem with sharing the IBA constant names for MAD structures (like RDMA_NODE_IB_SWITCH) between IB and OPA code. They already share the structure layouts/etc. The node type is reflected to user-space, which, as I mentioned above, is important. Abusing this enumeration is misleading, even in the kernel. Jason's proposal for a 'cap_is_switch_smi' is more readable, and directly in line with the explicit capability approach that we discussed.
RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing
From: Ira Weiny ira.we...@intel.com Hi Ira, OPA cannot impersonate IB; OPA node and link types have to be designated as such. In terms of MAD processing flows, both explicit (as in the handle_opa_smi() call below) and implicit code paths (which share IB flows - there are several cases) must make this distinction. +static enum smi_action +handle_opa_smi(struct ib_mad_port_private *port_priv, +struct ib_mad_qp_info *qp_info, +struct ib_wc *wc, +int port_num, +struct ib_mad_private *recv, +struct ib_mad_private *response) +{ ... + } else if (port_priv-device-node_type == RDMA_NODE_IB_SWITCH) --Liran -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing
Why do you have RDMA_NODE_IB_SWITCH related stuff inside the handle_opa_smi() function? Is there a node type of switch in OPA similar to IB? Yes. OPA uses the same node types as IB. Ira No, OPA cannot impersonate IB. It has to have distinct node and link types. --Liran -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RESEND PATCH V3 for-next 0/3] HW Device hot-removal support
From: Doug Ledford [mailto:dledf...@redhat.com] I suppose that the main issue would be handling existing user memory mappings, which cannot be just invalidated -- the user-space driver may not be aware of the device removal and may access this memory concurrently, and we don't want it to crash. In this case, you are mapping it out of the device BAR space and into a random kernel page, yes? So, if the driver doesn't catch the DRIVER_FATAL event and process that to mean don't bother touching this RDMA device any more, it's going to write to a mailbox that no longer responds and have infinite timeouts, yes? Essentially meaning all mailbox type operations just go into lala land from here on out, right? The kernel activity is asynchronous to user-space. The device may be un-plugged before the user-space driver has a chance to learn that a DEVICE_FATAL event has occurred. In fact, in the current user-space stack design, device drivers don't have a context of their own to read() from file descriptors and rely on the application for that. But even so, you probably don't want a driver to invoke a system call during the fast path just to N�r��yb�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
RE: [RESEND PATCH V3 for-next 0/3] HW Device hot-removal support
From: Doug Ledford [mailto:dledf...@redhat.com] I suppose that the main issue would be handling existing user memory mappings, which cannot be just invalidated -- the user-space driver may not be aware of the device removal and may access this memory concurrently, and we don't want it to crash. In this case, you are mapping it out of the device BAR space and into a random kernel page, yes? So, if the driver doesn't catch the DRIVER_FATAL event and process that to mean don't bother touching this RDMA device any more, it's going to write to a mailbox that no longer responds and have infinite timeouts, yes? Essentially meaning all mailbox type operations just go into lala land from here on out, right? Pressed 'send' too early... The kernel activity is asynchronous to user-space. The device may be un-plugged before the user-space driver has a chance to learn that a DEVICE_FATAL event has occurred. In fact, in the current user-space stack design, device drivers don't have a context of their own to read() from file descriptors and rely on the application for that. But even so, you probably don't want a driver to invoke a system call during the fast path just to check this condition. For devices that just write the BAR space, an arbitrary kernel page would do. Other devices might wish to first populate this page so that the user-space driver can detect this situation efficiently. --Liran
RE: [PATCH v6 02/26] IB/Verbs: Implement raw management helpers
From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- Add raw helpers: rdma_tech_ib rdma_tech_iboe rdma_tech_iwarp rdma_ib_or_iboe (transition, clean up later) To help us detect which technology the port supported. Replace rdma_tech_* with rdma_protocol_*. Cc: Hal Rosenstock h...@dev.mellanox.co.il Cc: Steve Wise sw...@opengridcomputing.com Cc: Tom Talpey t...@talpey.com Cc: Jason Gunthorpe jguntho...@obsidianresearch.com Cc: Doug Ledford dledf...@redhat.com Cc: Ira Weiny ira.we...@intel.com Cc: Sean Hefty sean.he...@intel.com Signed-off-by: Michael Wang yun.w...@profitbricks.com --- include/rdma/ib_verbs.h | 25 + 1 file changed, 25 insertions(+) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index d54f91e..a12e876 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1748,6 +1748,31 @@ int ib_query_port(struct ib_device *device, enum rdma_link_layer rdma_port_get_link_layer(struct ib_device *device, u8 port_num); +static inline int rdma_tech_ib(struct ib_device *device, u8 port_num) { + return device-query_transport(device, port_num) + == RDMA_TRANSPORT_IB; +} + +static inline int rdma_tech_iboe(struct ib_device *device, u8 port_num) +{ + return device-query_transport(device, port_num) + == RDMA_TRANSPORT_IBOE; Remove RDMA_TRANSPORT_IBOE. In the current code, the test should be: (IB transport Ethernet link layer). We can later consider each provider declaring the transports directly. +} + +static inline int rdma_tech_iwarp(struct ib_device *device, u8 +port_num) { + return device-query_transport(device, port_num) + == RDMA_TRANSPORT_IWARP; +} + +static inline int rdma_ib_or_iboe(struct ib_device *device, u8 +port_num) { + enum rdma_transport_type tp = device-query_transport(device, +port_num); + + return (tp == RDMA_TRANSPORT_IB || tp == RDMA_TRANSPORT_IBOE); } Remove RDMA_TRANSPORT_IBOE. Just test against RDMA_TRANSPORT_IB. + int ib_query_gid(struct ib_device *device, u8 port_num, int index, union ib_gid *gid); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v5 00/27] IB/Verbs: IB Management Helpers
From: Hefty, Sean [mailto:sean.he...@intel.com] [snip] So, I think that our old-transport below is just fine. No need to change it (and you aren't, since it is currently implemented as a function). I think there is a need to change this. Encoding the transport into the node type is not a good idea. Having different transport semantics while still returning the same transport for the port is confusing. The only thing which is clear currently is Link Layer. But the use of Link Layer in the code is so convoluted that it is very confusing. I agree. One could implement software iWarp or IBoUDP (RoCEv2) protocols that could run over any link layer and interoperate with existing HW solutions. The stack shouldn't be dealing with the link level at all, with the exception of user space compatibility. Define Transport? There has been a lot of discussion over what a transport is in Verbs. IMO, we should replace using the word 'transport' with just 'rdma_protocol'. And even then I'm not convinced that anything should care, beyond user space compatibility. The caps are what matter. - Sean I completely agree. If we ever see a need for representing a set or subset of cross-layer protocols (at any level, L2-L4, various encapsulations), we will add the proper management helpers. For example: - rdma_protocol_roce() /* both v1 and v2 */ - rdma_protocol_roce_v1() - rdma_protocol_roce_v2() - rdma_protocol_usnic() - rdma_protocol_usnic_udp()
RE: [PATCH v5 00/27] IB/Verbs: IB Management Helpers
From: Michael Wang [mailto:yun.w...@profitbricks.com] [snip] Depends on who is we. For ULPs, you are probably right. However, core services (e.g., mad management, CM, SA) do care about various details. In some cases, where it doesn't matter, this code will use management helpers. In other cases, this code will inspect link, transport, and node attributes of rdma devices. For example, the CM code has specific code paths for IB, RoCE, and iWARP. There is no other CM code; there is no reason to abstract 'CM'. This code will have code paths that depend on various specific details. That's exactly what we want to stop, we have classified the CM to IB and IWARP now :-) We don't want to stop code branches that are not abstractions but rather depend on the specific technology! There is no generic iWARP CM - only one. There is no generic ROCE CM - only one. There is no generic IB CM - only one. At the CM high-level (i.e., whether an ib_dev port registers an IB client), you could consider an rdma_has_cm() call, but this the only place in the code that this check will be called! Hence, no need for a generic check. You want to stop abstract code that uses IB core infrastructure. This new transport is only understand by core-layer currently, for user-layer we still reserve the old transport for them, next step is to use bitmask instead of transport, at that time we can erase the new transport and make the whole stuff used by user-layer only :-) I am not sure that we need a bit mask at all. Your helpers already provide all the useful abstractions, which both core and ULPs call directly. All the information is inferred directly from link, transport, node tuples. Some of the user-space tools need *exactly* the same reasoning. For example, management tools manage specific technologies and protocols, not some abstraction. So, For user-space, we can think about exposing exactly the same helper framework, while providing backward compatibility for the existing interfaces. I'd really like to put the topic on bitmask and user app reform into different thread... bitmask should be next topic, there are many discussion already, but I could imaging far more discussion there, the user reform should be the last step, after every thing in kernel settled down :-) OK Detailed remarks == 1) The introduction of cap_*_*() stuff should have been introduced directly in patch 02/27. This back-and-forth between rdma_ib_or_iboe() and cap_* is confusing and increases the number of patches in the patch-set. Do this and remove patches 16-24. We have some discussion about compress the patch set, merge the reform and introducing patch will mix the concept (like the earlier version), IMHO it will increase the difficulty of review... And now since many review already been done, it's not wise to change the whole structure of patch set IMHO... I think it is because you are conditioning code on one thing, and then conditioning the same code on another thing. This is confusing. Once we get our abstractions correct (i.e., the right helper functions), you replace the existing logic with the suitable helper up-front. We need to classify and integrate the concept into mgmt helper, that would be very helpful for further reform, reform followed by integration sounds not that bad, correct? The problem is that it is hard to follow the reasoning for the first use consumer code with the in-complete helper frame-work. 2)The name rdma_tech_* is lame. rdma_transport_*(), adhering to the above (*) remark, is much better. For example, both IB and ROCE *do* use the same transport. We have some discussion on that too, use transport means going back... No. The existing notion of transport was correct. It was the node type that wasn't. And in any case the new helpers didn't use it. We need the original meaning of transport - see my response to Ira. I propose replacing rdma_node_get_transport() with the following helpers: - rdma_get_transport() - rdma_is_ib_transport() - rdma_is_iwarp_transport() We can change the name at anytime, tech/transport/protocol/standard, just one patch later can easily change it and start the topic of naming, any of these name will unsatisfied someone AFAIK, I'd like to suggest we consider this as a mark temporarily and focus on the logical issue. Sure. The logical issue is: 1. We need the existing notion of transport, meaning a bunch of L4+headers + semantics presented to apps. 2. We might need an *additional* notion of rdma_protocol, which designates a complete wire-format: L2-L4+ including. This could be later a bitmask, a management helper, whatever. Currently, I don't see anything in the existing code that would call such helpers. - ... 3) The name cap_* as it is used above is not accurate. You use it to
RE: [PATCH v6 01/26] IB/Verbs: Implement new callback query_transport()
From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- [snip] a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 65994a1..d54f91e 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -75,10 +75,13 @@ enum rdma_node_type { }; enum rdma_transport_type { + /* legacy for users */ RDMA_TRANSPORT_IB, RDMA_TRANSPORT_IWARP, RDMA_TRANSPORT_USNIC, - RDMA_TRANSPORT_USNIC_UDP + RDMA_TRANSPORT_USNIC_UDP, + /* new transport */ + RDMA_TRANSPORT_IBOE, Remove RDMA_TRANSPORT_IBOE - it is not a transport. ROCE uses IBTA transport. If any code should test for ROCE should invoke a specific helper, e.g., rdma_protocol_iboe(). This is what you currently call rdma_tech_iboe is patch 02/26. I think that pretty much everybody agrees that rdma_protocol_*() is a better name than rdma_tech_*(), right? So, let's change this. The semantics are: check that a link supports a certain wire protocol, or a set of wire protocols, where 'certain' refers to the specific helper... -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v5 00/27] IB/Verbs: IB Management Helpers
From: Michael Wang [mailto:yun.w...@profitbricks.com] Hi, Liran Thanks for the comment :-) On 04/22/2015 01:36 AM, Liran Liss wrote: [snip] (**) This has been extended to also encode the transport in the current code. At least for user-space visible APIs, we might chose to leave this for backward compatibility, but we can consider cleaning up the kernel code. So, I think that our old-transport below is just fine. No need to change it (and you aren't, since it is currently implemented as a function). The new-transport does not really exist, but is broken into several capability checks of the L4 transport, optionally with conditions on the link type. I would remove the table below and tell what we really want to achieve: == move technology-specific feature-check logic out of the (multiple!) IB code components and various ULPs into per-feature helpers. Our purpose is to help core layer do management more clearly, rather then referring from transport and linklayer. Right. IMHO from management's point of view, what we really care about is whether a particular management required by device or not, rather then the details on transport and link layer. Depends on who is we. For ULPs, you are probably right. However, core services (e.g., mad management, CM, SA) do care about various details. In some cases, where it doesn't matter, this code will use management helpers. In other cases, this code will inspect link, transport, and node attributes of rdma devices. For example, the CM code has specific code paths for IB, RoCE, and iWARP. There is no other CM code; there is no reason to abstract 'CM'. This code will have code paths that depend on various specific details. This new transport is only understand by core-layer currently, for user-layer we still reserve the old transport for them, next step is to use bitmask instead of transport, at that time we can erase the new transport and make the whole stuff used by user-layer only :-) I am not sure that we need a bit mask at all. Your helpers already provide all the useful abstractions, which both core and ULPs call directly. All the information is inferred directly from link, transport, node tuples. Some of the user-space tools need *exactly* the same reasoning. For example, management tools manage specific technologies and protocols, not some abstraction. So, For user-space, we can think about exposing exactly the same helper framework, while providing backward compatibility for the existing interfaces. Detailed remarks == 1) The introduction of cap_*_*() stuff should have been introduced directly in patch 02/27. This back-and-forth between rdma_ib_or_iboe() and cap_* is confusing and increases the number of patches in the patch-set. Do this and remove patches 16-24. We have some discussion about compress the patch set, merge the reform and introducing patch will mix the concept (like the earlier version), IMHO it will increase the difficulty of review... And now since many review already been done, it's not wise to change the whole structure of patch set IMHO... I think it is because you are conditioning code on one thing, and then conditioning the same code on another thing. This is confusing. Once we get our abstractions correct (i.e., the right helper functions), you replace the existing logic with the suitable helper up-front. 2)The name rdma_tech_* is lame. rdma_transport_*(), adhering to the above (*) remark, is much better. For example, both IB and ROCE *do* use the same transport. We have some discussion on that too, use transport means going back... No. The existing notion of transport was correct. It was the node type that wasn't. And in any case the new helpers didn't use it. We need the original meaning of transport - see my response to Ira. I propose replacing rdma_node_get_transport() with the following helpers: - rdma_get_transport() - rdma_is_ib_transport() - rdma_is_iwarp_transport() - ... 3) The name cap_* as it is used above is not accurate. You use it to describe technology characteristics rather than extendable capabilities. I would suggest having a single convention for all helpers, such as rdma_has_*() and rdma_is_*(). For example: cap_ib_smi() == rdma_has_smi(). That means going back too... See response to Ira (https://lkml.org/lkml/2015/4/21/951). 4) Remove all capabilities that do not introduce any distinction in the current code. We can add them as needed later. This means remove patches: - [PATCH v5 22/27] IB/Verbs: Use management helper cap_ipoib() – all IB devices support ipoib - [PATCH v5 24/27] IB/Verbs: Use management helper cap_af_ib() – all IB devices support AF_IB. On the other hand: - rdma_has_multicast() makes sense, since iWARP doesn’t support it. - cap_ib_sa() might make sense to cut code even further in the CMA, since RoCE has a GSI but no SA. We
RE: [PATCH v5 00/27] IB/Verbs: IB Management Helpers
Hi Michael, The spirit of this patch-set is great, but I think that we need to clarify some concepts. Since this will affect the whole patch-set, I am laying out my concerns here instead. A suggestion for the resulting management helpers is given below. I believe the result would be much more coherent. --Liran In general An ib_dev (or a port of) should be distinguished by 3 qualifiers: - The link layer: -- Ethernet (shared by iWARP, USNIC, and ROCE) -- Infiniband - The transport (*) -- IBTA transport (shared by IB and ROCE) -- iWARP transport -- USNIC transport (*) Transport means both: - The L4 wire protocols (e.g., BTH+ headers of IBTA, optionally encapsulated by UDP in ROCEv2, or the iWARP stack) - The transport semantics (for example, there are slight semantic differences between IBTA and iWARP) - The node type (**) -- CA -- Switch -- Router (**) This has been extended to also encode the transport in the current code. At least for user-space visible APIs, we might chose to leave this for backward compatibility, but we can consider cleaning up the kernel code. So, I think that our old-transport below is just fine. No need to change it (and you aren't, since it is currently implemented as a function). The new-transport does not really exist, but is broken into several capability checks of the L4 transport, optionally with conditions on the link type. I would remove the table below and tell what we really want to achieve: == move technology-specific feature-check logic out of the (multiple!) IB code components and various ULPs into per-feature helpers. Detailed remarks == 1) The introduction of cap_*_*() stuff should have been introduced directly in patch 02/27. This back-and-forth between rdma_ib_or_iboe() and cap_* is confusing and increases the number of patches in the patch-set. Do this and remove patches 16-24. 2)The name rdma_tech_* is lame. rdma_transport_*(), adhering to the above (*) remark, is much better. For example, both IB and ROCE *do* use the same transport. 3) The name cap_* as it is used above is not accurate. You use it to describe technology characteristics rather than extendable capabilities. I would suggest having a single convention for all helpers, such as rdma_has_*() and rdma_is_*(). For example: cap_ib_smi() == rdma_has_smi(). 4) Remove all capabilities that do not introduce any distinction in the current code. We can add them as needed later. This means remove patches: - [PATCH v5 22/27] IB/Verbs: Use management helper cap_ipoib() – all IB devices support ipoib - [PATCH v5 24/27] IB/Verbs: Use management helper cap_af_ib() – all IB devices support AF_IB. On the other hand: - rdma_has_multicast() makes sense, since iWARP doesn’t support it. - cap_ib_sa() might make sense to cut code even further in the CMA, since RoCE has a GSI but no SA. 5) Do no modify phys_state_show() in [PATCH v5 09/27] IB/Verbs: Reform IB-core verbs/uverbs_cmd/sysfs It *is* the link layer! 6) Remove cap_read_multi_sge It is not device/port feature, but a transport capability. Use rdma_is_iwarp_transport() instead, or introduce a new transport flag in 'enum ib_device_cap_flags'. 7) Remove [PATCH v5 25/27] IB/Verbs: Use management helper cap_eth_ah(). Address handles that refer to Ethernet links always have Ethernet addressing. In the CMA code, using rdma_tech_iboe() is just fine. This is how you define cap_eth_ah() anyway. Currently, this patch just adds clutter. 8) Remove patch [PATCH v5 26/27] IB/Verbs: Clean up rdma_ib_or_iboe(). We do need a transport qualifier, as exemplified in comment 5) above, and for a complete clean model. This is after renaming the function to rdma_is_ib_transport()... Putting it all together == We are left with the following helpers: - rdma_is_ib_transport() - rdma_is_iwarp_transport() - rdma_is_usnic_transport() - rdma_is_iboe() - rdma_has_mad() - rdma_has_smi() - rdma_has_gsi() - complements smi; can be used by the mad code for clarity - rdma_has_sa() - rdma_has_cm() - rdma_has_mcast() Subject: [PATCH v5 00/27] IB/Verbs: IB Management Helpers Since v4: * Thanks for the comments from Hal, Sean, Tom, Or Gerlitz, Jason, Roland, Ira and Steve :-) Please remind me if anything missed :-P * Fix logical issue inside 3#, 14# * Refine 3#, 4#, 5# with label 'free' * Rework 10# to stop using port 1 when port already assigned There are plenty of lengthy code to check the transport type of IB device, or the link layer type of it's port, but actually we are just speculating whether a particular management/feature is supported by the device/port. Thus instead of inferring, we should have our own mechanism for IB management capability/protocol/feature checking, several proposals below. This patch set will reform the method of getting transport type, we will now using query_transport() instead of inferring from transport and link layer respectively, also we defined the
RE: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
RoCE v2 is really Infiniband over UDP over IP. Why don't we just call it IBoUDP like it is? RoCEv2 is the name in the IBTA spec (Annex 17) We call RoCE IBoE in the kernel, because that's what it is. RoCE is an IBTA marketing name. Looking through the Annex, I don't see where Ethernet is even a requirement for this technology to work. The IB transport is layered over a standard UDP header. I do see where the spec calls out updating the IP header, but that's it. Regardless of what it's called, it replaces the underlying network and transport protocols, versus IB-classic or IBoE/RoCE. That should be captured properly, not by saying there's a new GID type. RoCE v2 doesn't even use GIDs as part of its protocol. It uses UDP/IP addresses. The RoCE Verbs interface references the HCA GID table in QPs and AHs, for all RoCE versions. The specification mandates that the Verbs consumer may use the following protocols over the same RoCE-capable HCA and the same physical port: - RoCEv1 (L2, IB GRH, IB BTH+, payload) - RoCEv2 using IPv4 (L2, IPv4, UDP, IB BTH+, payload) - RoCEv2 using IPv6 (L2, IPv6, UDP, IB BTH+, payload) The distinction (by spec) is done by associating a GID type attribute to each GID table entry, which is either IB, IPv4, or IPv6. This is how apps can create different RC QPs using these different wire protocols, or a single UD QP that can send packets to all of these wire protocols. Perhaps we could add another enum entry for RoCEv1 in the patch to make this clearer. IBoUDP changes the Ethertype, replaces the network header, adds a new transport protocol header, and layers IB over that. This change should be exposed properly and not as just a new GID type. I don't understand what do you suggest here. Can you give an example? I don't have a solution here. Please look at Michael Wang's patch series and see how this would fit into that model. The introduction of iWarp required defining a new 'transport' type. IBoE added a new link layer. Based on those changes, this would warrant introducing a new network layer, so that it can be distinguished properly from the other options. Maybe that's the right approach? The new-transport in Michael's patches doesn't refer to the network transport layer, but rather acts as a summary of the tuple link, transport, node_type (the network layer is indeed skipped). The network transport layer of Infiniband and *all* RoCE types is the same: IB transport. As I said earlier, the network layer (e.g., IPv4, IPv6, or GRH) cannot be a port attribute because RoCE HCAs support all of them over the same physical port. Maybe we should change these patches to encode the port summary as a bit mask, and provide convenient masks for queries? Alternatively, we could leave the port qualifiers as they are (i.e., with distinct link, transport, node_type qualifiers) instead of introducing new-transport, but provide convenient wrappers for ULPs to use. For example: - is_roce() /* returns true for all RoCE wire protocols */ - is_rocev1() - is_rocev2() - is_iwarp() - ... Cisco's NIC reports a transport layer of 'USNIC_UDP', which should really just be 'UDP'. That NIC supports UDP/IP/Ethernet, based on the rdma stack's model. RoCE v2 is also UDP/IP/Ethernet, it only layers IB over that. (This makes the use of the term 'transport' confusing. Maybe there should also be a 'session' protocol?) It seems completely reasonable that a device which does IB/UDP/IP/Ethernet to someday expose the UDP/IP/Ethernet portion (if it doesn't already), and from the same port at the same time. RoCEv2 uses UDP only as an encapsulation protocol (just like VXLAN). The transport is well-defined: IBTA transport (just like Infiniband). In the USNIC case, UPD is indeed the actual network transport. Rather than continuing to try to make everything look like an IB-classic device because it's convenient, the stack needs to start exposing things properly. I don't know what the right solution should be, but trying to capture this level of detail as a different GID type definitely looks like a step in the wrong direction. As mentioned above, this is how Verbs consumers send traffic using the different wire-protocols over the same physical port. This patch-set follows the specification, and cleanly provides all per-GID associations in 'struct ib_gid_attr'. So basically, in RoCE, a GID entry represents a network interface (a netdev) to the HCA, and encompasses all information related with that interface: MAC, VLAN, MTU, IP address, namespace, etc. This also makes it straight forward to add future extensions. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
The RoCE Verbs interface references the HCA GID table in QPs and AHs, for all RoCE versions. The IBTA specifically does not define software interfaces. The concern here is the architecture and definition of the linux rdma software stack, not verbs, despite the fact that the layer is named that. Whatever. The ib_* API is the Linux implementation for IBTA Verbs (both Infiniband and RoCE flavors) and iWARP Verbs. This interface uses a GID table. In this patch-set, there is no conceptual change in this regard. It adds GID types, as stated by the RoCE specification, which is a natural extension to the existing code. I am sure that we can consider additional changes in the future, but one patch-set at a time... As I said earlier, the network layer (e.g., IPv4, IPv6, or GRH) cannot be a port attribute because RoCE HCAs support all of them over the same physical port. IMO, the architecture needs to change to support multiple network and transport layers over the same link layer. The current restriction of 1 transport per port is too restrictive, and doesn't seem to be true even for current devices, like usnic. usnic uses UDP transport. Infiniband and RoCE use the IB transport (all the BTH+ header stuff as defined by IBTA). iWARP uses IETF defined protocols over TCP, namely RDMAP+DDP+MPA. I am not aware of any device that mixes these transports today. If we ever see such a need in the future, let's talk again. Currently, this is irrelevant. Maybe we should change these patches to encode the port summary as a bit mask, and provide convenient masks for queries? Switching to bit masks is the longer term objective. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Status of ummunot branch?
Here are a few more clarifications: 1) ODP MRs can cover address ranges that do not have a mapping at registration time. This means that MPI can register in advance, say, the lower GB's of the address space, covering malloc's primary arena. Thus, there is no need to adjust to each increase in sbrk(). Similarly, you can register the stack region up to the maximum size of the stack. The stack can grow and shrink, and ODP will always use the current mapping. 2) Virtual addresses covered by an ODP MR must have a valid mapping when they are is accessed (during send/receive WQE processing or as a target of an RDMA/atomic operation). So, Jeff, the only thing you need to make sure is that you don't free() a buffer that you posted and haven't got a completion yet - but I guess that this is something that you already do... :) For example, in the following scenario: a. reg_mr(first GB of the address space) b. p = malloc() c. post_send(p) d. poll for completion e. free(p) f. p = malloc() g. post_send(p) h. poll for completion i. free(p) (c) may incur a page fault (if not pre-fetched or faulted-in by another thread). (e) happens after the completion, so it is guaranteed that (c), when processed by HW, uses the correct application buffer with the current virt-to-phys mapping (at HW access time) The reallocation may or may not change the virtual-to-physical mappings. The message may or may not be paged out (ODP does not hold a reference on the page). In any case, when (g) is processed, it always uses the current mapping. --Liran -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe Sent: Saturday, June 08, 2013 2:58 AM To: Jeff Squyres (jsquyres) Cc: Haggai Eran; Or Gerlitz; linux-rdma@vger.kernel.org; Shachar Raindel Subject: Re: Status of ummunot branch? On Fri, Jun 07, 2013 at 10:59:43PM +, Jeff Squyres (jsquyres) wrote: I don't think this covers other memory regions, like those added via mmap, right? We talked about this at the MPI Forum this week; it doesn't seem like ODP fixes any MPI problems. ODP without 'register all address space' changes the nature of the problem, and fixes only one problem. You do need to cache registrations, and all the tuning parameters (how much do I cache, how long do I hold it for, etc, etc) all still apply. What goes away (is fixed) is the need for intercepts and the need to purge address space from the cache because the backing registration has become non-coherent/invalid. Registrations are always coherent/valid with ODP. This cache, and the associated optimization problem, can never go away. With a 'register all of memory' semantic the cache can move into the kernel, but the performance implication and overheads are all still present, just migrated. 2. MPI still has to intercept (at least) munmap(). Curious to know what for? If you want to prune registrations (ie to reduce memory footprint), this can be done lazyily at any time (eg in a background thread or something). Read /proc/self/maps and purge all the registrations pointing to unmapped memory. Similar to garbage collection. There is no harm in keeping a registration for a long period, except for the memory footprint in the kernel. 3. Having mmap/malloc/etc. return new memory that may already be registered because of a prior memory registration and subsequent munmap/free/etc. is just plain weird. Worse, if we re-register it, ref counts could go such that the actual registration will never actually expire until the process dies (which could lead to processes with abnormally large memory footprints, because they never actually let go of memory because it's still registered). This is entirely on the registration cache implementation to sort out, there are lots of performance/memory trade offs. It is only weird when you think about it in terms of buffers. memory registration has to do with address space, not buffers. What MPI wants is: 1. verbs for ummunotify-like functionality 2. non-blocking memory registration verbs; poll the cq to know when it has completed To me, ODP with an additional 'register all address space' semantic, plus an asynchronous prefetch does both of these for you. 1. ummunotify functionality and caching is now in the kernel, under ODP. RDMA access to an 'all of memory' registration always does the right thing. 2. asynchronous prefetch (eg as a work request) triggers ODP and kernel actions to ready a subset of memory for RDMA, including all the work that memory registration does today (get_user_pages, COW break, etc) Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to
RE: Status of ummunot branch?
-Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- ow...@vger.kernel.org] On Behalf Of Jeff Squyres (jsquyres) Sent: Monday, June 10, 2013 5:50 PM To: Jason Gunthorpe Cc: Haggai Eran; Or Gerlitz; linux-rdma@vger.kernel.org; Shachar Raindel Subject: Re: Status of ummunot branch? On Jun 7, 2013, at 4:57 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: We talked about this at the MPI Forum this week; it doesn't seem like ODP fixes any MPI problems. ODP without 'register all address space' changes the nature of the problem, and fixes only one problem. I agree that pushing all registration issues out of the application and (somewhere) into the verbs stack would be a nice solution. You do need to cache registrations, and all the tuning parameters (how much do I cache, how long do I hold it for, etc, etc) all still apply. What goes away (is fixed) is the need for intercepts and the need to purge address space from the cache because the backing registration has become non-coherent/invalid. Registrations are always coherent/valid with ODP. This cache, and the associated optimization problem, can never go away. With a 'register all of memory' semantic the cache can move into the kernel, but the performance implication and overheads are all still present, just migrated. Good summary; and you corrected some of my mistakes -- thanks. That being said, everyone I've talked to about ODP finds it very, very strange that the kernel would keep memory registrations around for memory that is no longer part of a process. Not only does it lead to the new memory is magically already registered semantic that I find weird, it's just plain *odd* for the kernel to maintain state for something that doesn't exist any more. It feels dirty. Sidenote: I was just informed today that the current way MPI implementations implement registration cache coherence (glibc malloc hooks) has been deprecated and will be removed from glibc (http://sourceware.org/ml/libc-alpha/2011-05/msg00103.html). This really puts on the pressure to find a new / proper solution. What MPI wants is: 1. verbs for ummunotify-like functionality 2. non-blocking memory registration verbs; poll the cq to know when it has completed To me, ODP with an additional 'register all address space' semantic, plus an asynchronous prefetch does both of these for you. 1. ummunotify functionality and caching is now in the kernel, under ODP. RDMA access to an 'all of memory' registration always does the right thing. Register all address space is the moral equivalent of not having userspace registration, so let's talk about it in those terms. Specifically, there's a subtle difference between: a) telling verbs to register (0...2^64) -- Which is weird because it tells verbs to register memory that isn't in my address space Another way to look at it is specify IO access permissions for address space ranges. This could be useful to implement a buffer pool to be used for a specific MR only, yet still map/unmap memory within this pool on the fly to optimize physical memory utilization. In this case, you would provide smaller ranges than 2^64... b) telling verbs that the app doesn't want to handle registration -- How that gets implemented is not important (from userspace's point of view) -- if the kernel chooses to implement that by registering non-existent memory, that's the kernel's problem I guess I'm arguing that registering non-existent memory is not the Right Thing. Regardless of what solution is devised for registered memory management (ummunotify, ODP, or something else), a non-blocking verb for registering memory would still be a Very Useful Thing. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH net-next V0 18/21] mlx4_core: adjust catas operation for SRIOV mode
On Fri, Dec 2, 2011 at 2:19 AM, Yevgeny Petrilin yevge...@mellanox.co.il wrote: When running in SRIOV mode, driver should not automatically start/stop the mlx4_core upon sensing an HCA internal error -- doing this disables/enables sriov, which will cause the hypervisor to hang if there are running VMs with attached VFs. Not sure I understand this -- what happens if the driver doesn't reset the device after a catastrophic error? Surely all the VFs are pretty screwed at that point? Which hypervisor are we talking about here anyway? - R. Indeed, if you don't reset the device after an internal error, the PF/VF might not work properly unless you reload the PF driver. However, invoking the SW reset by the PF (upon detecting an internal error) currently affects the SRIOV capability on the PCI so we cannot do it before disabling SRIOV first. Since the driver is not in charge of passing-through VFs to VMs, it cannot disable SRIOV either so we cannot reset the device while using SRIOV. Note that the single-function behavior is not modified. We intend to post a fix for this in a different patch-set. --Liran -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: When IBoE will be merged to upstream?
Small correction needed regarding the multicast forwarding. Since we are talking about IPv6 multicast groups, which translate to 33:33:xx:xx:xx:xx MAC address, the router listener notification protocol is going to be MLD and not IGMP. Still there are switches which support MLD forwarding to prevent the network flooding. Well as I said the mapping of IBoE MGID to Ethernet address is not specified. However I agree that using the same mapping as IPv6 so we end up with 33:33:... addresses makes sense. Agreed. Yes, you are right that MLD snooping is the mechanism for switches to discover IPv6 multicast group membership. However for the IBoE case there is no requirement that IPv6 multicast group membership corresponds in any way to the IBoE multicast group membership for the interface (and indeed as far as I can tell from the IBoE spec, there is no requirement that any IPv6 interface be configured on an IBoE port). Furthermore, even if an IBoE interface sends MLD messages for a given IPv6 group, there is no requirement that a switch use the membership information for that group to forward multicast packets with a non-IPv6 ethertype. Right. Initially there can be flooding within the VLAN. In the future we can evolve to use a group-membership protocol when customers that care about the efficiency drive their switch vendors to support it. Are there any other issues that you would like us to address before updating the patches? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: When IBoE will be merged to upstream?
The text is saying that the specification does not use any of the LID fields in the verbs interface, that is it. It isn't talking about MAC addresses. Exactly how and where the MAC address comes about was never decided, and at least some participants thought it should be a 1:1 algorithmic mapping from the GID. Ditto for VLANs, how and where the vlan tag comes about is not part of the spec. You are trying to rewrite history. Read the spec, address handles fields are fixed. Not really, this was all discussed on this list before the IBxoE working group was formed, The paragraph above is about the RoCE spec. And *this list* did not write the RoCE spec. it was discussed in the working group, The RoCE spec adopts the verbs defined in the base IB spec and does not add any new input modifiers to the AH verb. You may not agree with it but that does not change the spec. I objected to the draft spec leaving this area absent, even. You should submit a comment on this matter using the IBTA comment tracker database if you intend your concern to be taken into account. The spec doesn't say squat about how MAC and VLAN values get into the AH, True. The spec does not say it because there are no MAC and VLAN input modifiers to the create AH verb. The spec assumes the resolution from the L3 address happens below the channel interface. and you have already heard how my opinion on this subject differs from others. I never attempted to misrepresent your opinion. I am just pointing out what the RoCE spec says. But, even if we do get there some day then we could extend the AH. This is unacceptable - we are not going to add another L3 identifier. It wouldn't be adding another L3 itentifier it would be an L2 next hop MAC address for the router. It would be nice to do this from the start but if growing the AH is really that scary then it should wait until someone figures out how to solve the lossless routing problem on ethernet. Augmenting the AH has a significant cost. There is a tradeoff here between preserving the verbs api vs. dealing with the implementation challenges associated with doing address resolution below the verbs. The RoCE spec deliberately chooses one direction. You seem to favor the other one. But in the interest of progress and since we all seem to agree on the way things work when we use link local GIDs, let us move forward with that approach for now. And we can get back to non local GIDs later. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: When IBoE will be merged to upstream?
But, we can't mandate an overload of the GID in a way that it prevents its use as a true L3 address (eventually routable). Actually I'm beginning to think that the only possible way we can use the GID in IBoE is as a link-local IPv6 addresses containing an Ethernet address. Trying to hide neighbour discovery or ARP below the verbs doesn't seem workable -- being forced to change the locking rules we've had for the past 5+ years about create_ah is just the beginning. We get further problems if a remote address should ever change and I'm probably missing other issues. We believe the problems are workable. But let us stop arguing for a while and make progress with link local addressees since we all seem to agree with that. We can get back to non-local GIDs later So the best solution I can see is to declare that an IBoE GID must be an IPv6 address coming from an EUI-64 Ethernet address for the corresponding port; for MGIDs I guess we use the standard IPv6 mapping to Ethernet address 33:33:xx:xx:xx:xx. I'm not sure how we want to handle IPv4 -- presumably unicast ARP can be done within the RDMA CM, which will then create a DGID with the appropriate Ethernet address. However it's not clear to me whether we need a way to create IPv4 (01:00:5e:xx:xx:xx) multicast addresses. Also, since there is no way to map a link-local IPv6 address to a particular interface, then I guess we need a way to pass in the VLAN tag to be used -- presumably we can steal some other field for the 12 bits. (The fact that the IBoE annex does not mention VLANs or 802.1q a single time is just another thing that shows how rushed and incomplete it is) With all this said, I think it means we do not need to do the mapping from GID to Ethernet address in the kernel for IBoE user verbs, since it is so simple -- we can simply add a fairly trivial helper to libibverbs. - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: When IBoE will be merged to upstream?
A quibble about multicast - AFAIK this is unsolved. I think some spec needs to be agreed that documents what sort of multicast snooping operations switches need to do, ie if IGMP joins imply that IBoE traffic for the same DMAC is included in the join, or if IBoE requires a seperate IGMP type process on its own ether-type. That would make it much clearer what to do with MGIDs. It would be quite naïve to require *new* snooping functionality in Eth switches. Some switches will gracefully apply to non-ip traffic the filtering information acquired through IGMP snooping. And some will just flood non-ip MC frames within the corresponding VLAN which is benign (e.g. that is the way FIP works). A cleaner solution would be based on MMRP but that, AFAIK, is not very widely deployed so it is less practical at this stage. I agree -- the current spec is rather broken for multicast. Choosing a different ethertype and then saying that all switches will just flood multicast traffic is half-baked at best. It is a realistic approach. Do you claim that there are switches that will not forward the packets? It would be nice to at least have a plan on how to integrate a non-link local address, if that is ever necessary in future. An extended AH with an additional 48 DMAC field seems reasonable to me? You mean have a next-hop destination + a final destination? Could be done I guess. But I'm not sure how having a routing table where you have to look up 48-bit Ethernet addresses is all that different from just having a standard Ethernet forwarding table. I guess Jason suggests regarding the GID as a true L3 address and using a new added L2 field for the next hop L2 address. I suppose something based on MAC-in-MAC (a la 802.1ah) could be done but to be honest the IBoE spec that the IBTA came up with looks rather broken for routing. Routing is out of the scope of the current RoCE spec. And I do not see how .1ah would be relevant for this purpose. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: When IBoE will be merged to upstream?
...A verbs consumer using a RoCE network relies strictly on so-called Layer 3 addressing (GIDs); layer 2 addresses (e.g. subnet local identifiers) are not passed across the verbs interface... Ah, hmm, well, I was on that list during this time and I don't think this statement means what you are saying it does :) ?? It doesn't get any clearer than this. 'subnet local identidifer' == LID The text is saying that the specification does not use any of the LID fields in the verbs interface, that is it. It isn't talking about MAC addresses. Exactly how and where the MAC address comes about was never decided, and at least some participants thought it should be a 1:1 algorithmic mapping from the GID. Ditto for VLANs, how and where the vlan tag comes about is not part of the spec. You are trying to rewrite history. Read the spec, address handles fields are fixed. Good idea! This is exactly what we do today for addresses that the user explicitly declares as link-local addresses. But, we can't mandate an overload of the GID in a way that it prevents its use as a true L3 address (eventually routable). We are very unlikely to see routable IBoE, ever.. Says who? But, even if we do get there some day then we could extend the AH. This is unacceptable - we are not going to add another L3 identifier. BTW, I absolutely hate the mixing of 'Sometimes it is a IPv4, sometimes it is a GID, and sometimes it is an IPv6' in the same field. That is just so nasty. The GID is a GID, don't overload it in an ambiguous way to mean 2 other things! A GID is a GID indeed -- in a RoCE environment, it's the layer 3 identifier. All of our intended values are standard ipv6 encapsulations. create_ah does not accept any sort of source address specifier You are wrong -- sgid_index specifies it. So, what do you propose to put in sgid_index? It isn't big enough to store an IPv6 address. You can't exactly number every IP assigned to every ethernet interface. An iboe device is associated with a specific Ethernet interface. Thus, its gid table only needs to map the ip addresses assigned to that interface. The other fields you mention are not a supserset of socket parameters, they are only IPv6 parameters, IPv4 uses a different set. Like what? Jason, bottom line, I think that we both agree that the rdmacm should do the address resolution. The difference is that by having the rdmacm initially only bind to the device and complete the resolution later (by a call from create_ah()), we don't change the user API for *all* gid types. Having addressed your concerns regarding resolution below the Verbs, we continue to believe that this is the best approach. Again, I don't see how what I've outlined changes the API in any way. We currently support link-local gids, but the architecture must not limit the scope. Doing two routing lookups for the same connection is bad design, it is racey. L2 parameters have to flow from the first routing lookup in RDMA-CM to everything else. So is caching L3--L2 mappings that change a second later... So what? Liran, I don't think you have at all come close to addressing my concerns, you still haven't explained how a full route lookup is even possible in create_ah, for instance. Let alone my other concerns! Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: When IBoE will be merged to upstream?
This discussion has derived into whether we need to expose Eth L2 params across the Verbs interface. This point has been discussed extensively in the IBTA during the development of the RoCE spec. And the direction chosen by the spec is clear: ...A verbs consumer using a RoCE network relies strictly on so-called Layer 3 addressing (GIDs); layer 2 addresses (e.g. subnet local identifiers) are not passed across the verbs interface... The motivation behind this direction had to do with preserving API transparency for applications. The goal was to allow existing and future applications to run above RoCE and native IB without changes. As opposed to what it may seem at first sight, adding Eth L2 parameters to the address vector, *does not* make RoCE closer to IB. It actually goes the other way around. Here is a quick list of what would have to be changed if we were to include Eth L2 address parameters to the Address Vector and other structures/functions that expose L2 params: Structure changes: - ibv_wc - ibv_ah_attr -- ibv_qp_attr -- rdma_ud_param --- rdma_cm_event - ibv_port_attr Verb API changes: - ibv_poll_cq() - ibv_init_ah_from_wc() - ibv_create_ah() - ibv_query_qp() - ibv_modify_qp() - ibv_query_port() - ibv_attach_mcast() - ibv_detach_mcast() rdmacm API changes: - rdma_post_ud_send() - rdma_get_cm_event() - rdma_ack_cm_event() As a result of this: - Existing IB binaries would cease working over RoCE. - Due to added fields in structures, even just recompiling existing applications from source would be problematic. - To make future applications work on both ib and RoCE transparently, you would need additional wrappers such as init_ah(), copy_ah(), and ah_is_equal(), and never inspect address handle fields directly. So why introduce differences between RoCE and IB (for the Application writers!!) when they *aren't* needed? Using rdmacm won't solve this either (UD traffic). By following the direction set forth by the RoCE spec none of this is required. Existing (rdmacm) application binaries do run over RoCE or IB unchanged. Granted, the RoCE spec approach introduces 2 *implementation* issues that we need to tackle: 1. Address resolution, which is a generic function, should not be a device-specific call. In this matter, we already proposed a solution where resolution is done, as required, in generic functions in the kernel. Specifically, we provide L2 information to user-space drivers via create_ah(), avoiding the need to add a new ABI call altogether, while the resolution would take place in a generic CMA routine. 2. The Kernel currently assumes that create_ah() can execute in atomic context. One option is to distinguish between the create_ah() calls (in the kernel) that are done for iboe, which are very few, and the rest of the calls that are ib-only. There are other approaches to solve this as well. It seems clear that our goal should be to solve these issues inside the kernel, in the cleanest manner as possible, while preserving transparency to the applications. Comments are welcome. Liran -Original Message- From: Or Gerlitz [mailto:ogerl...@voltaire.com] Sent: Wednesday, July 07, 2010 9:00 AM To: Liran Liss Cc: Roland Dreier; Jason Gunthorpe; Hefty, Sean; Aleksey Senin; linux-rdma; mo...@voltaire.com; aleks...@voltaire.com; yift...@voltaire.com; Tziporet Koren; al...@voltaire.com Subject: Re: When IBoE will be merged to upstream? Liran Liss wrote: but keeping ib_create_ah() callable from any context is not a goal by itself. going with your approach, if your proposed design is accepted, I believe that you probably need to patch all the code-chains that makes calls under the current assumption I am looking for constructive ideas for supporting iboe without breaking Verbs/CQE/CM syntax. I don't agree that exposing the Ethernet L2 related information to the caller is breaking something, the converse, it is a required enhancement. I think we need to let resolve through the rdma-cm get to know at the consumer level, what are the source / destination macs, vlan id and vlan priority used by an IBoE QP, in the exact manner all the IB equivalents (src/dst lid, pkey, sl) are resolved by the rdma-cm and exposed to the consmer app for IB QP. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: When IBoE will be merged to upstream?
VLANs are part of L2 in Ethernet -- when you resolve a destination L3 address to an L2 address, you get the outgoing interface, which also determines the VLAN. I think this approach has an advantage over an RDMA device per VLAN in that you keep the standard OS VLAN management (vconfig). I wouldn't judge the RoCE spec so quickly --- it guarantees that rdma application binaries could run on any network. What do you gain by exposing Eth-specific L2 params in the address handle? --Liran -Original Message- From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com] Sent: Thursday, June 24, 2010 11:37 PM To: Liran Liss Cc: Hefty, Sean; Roland Dreier; Aleksey Senin; linux-rdma; mo...@voltaire.com; aleks...@voltaire.com; yift...@voltaire.com; Tziporet Koren; al...@voltaire.com Subject: Re: When IBoE will be merged to upstream? The current behavior of ibv_create_ah() requires that the caller provide the L2, and if needed, L3 addressing. Any translation between the L3 and L2 addressing must be done before the call is made. E.g. ibv_create_ah does not use the GID to query the SA to obtain LIDs. Why doesn't IBoE follow this same model? LL: because of the RoCE spec, which states that only GID addressing is used at the Verbs level. The address handle fields are unchanged, and the L2 fields (e.g., lid) are reserved. Note that in Ethernet, you normally don't specify L2 addresses at the transport level (i.e., sockets). We do not have to lavishly follow the IBTA spec in the Linux implementation, especially if it makes no sense. I think Sean is on the right track here, the AH should take the L2 as input just like for IB, and the resolution is done in librdmacm, or somehow manually. The verbs layer is not really analogous to sockets anyhow, the librdmacm is much closer to a socket like interface, and it having a GID go into rdmacm and a full AH with L2/L3 info come out seems entirely reasonable. BTW, what ever was decided about vlans tagging? Is that part of the AH or do you use seperate RDMA devices per vlan? Seems like a point worth considering now. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: When IBoE will be merged to upstream?
Regarding GID to Eth mappings, we discussed using the standard create_ah() Verb for this. In the kernel, create_ah() will call a generic address resolution function in the cma. The returned information will be copied back to user-space in a device-specific structure (since address handles are device-specific). This eliminates adding a new get_eth_l2_addr() ABI call (device-specific or not). In fact, this approach eliminates adding new ABIs for any kind of address translation... Does this seem reasonable? --Liran -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Roland Dreier Sent: Thursday, June 24, 2010 2:24 AM To: Aleksey Senin Cc: linux-rdma; mo...@voltaire.com; aleks...@voltaire.com; yift...@voltaire.com; Tziporet Koren; al...@voltaire.com Subject: Re: When IBoE will be merged to upstream? This is actually a continue of the RAW_ET() issue. We want to make a submition of the patches to the upstream, but there is not support for IB transport in Ethernet devices, and the mlx4_en drivers version is a bit outdated 1.4.1.1 in upstream and 1.5.1 in the OFED There is also missing VLAN support that already present in the OFED. When do you planning to submit changes from OFED to upstream? - I do not search for more things to merge upstream. I have enough work reviewing things that are sent to me. So I will never look through OFED for changes. - I do not handle the mlx4_en driver. Changes for mlx4_en should go to netdev and Dave Miller. - I will try to get back to the IBoE changes when I have time, and I will admit that my time to spend as RDMA maintainer is nowhere near full time and less than it was in the past. - I did allocate a fair amount of time to spend on IBoE recently but unfortunately the patches were not really in a suitable state to merge, and I exhausted that time slice before we reached the end. When patch sets sit outside of the upstream kernel and are shipped in OFED for months and years, it would probably make upstream merging easier if that time was used to fix the patch set. - Specifically for the IBoE patches, shouldn't someone have realized that having a device-specific interface to do the standard mapping of GID to Ethernet address makes no sense? -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: When IBoE will be merged to upstream?
S.B. --Liran -Original Message- From: Hefty, Sean [mailto:sean.he...@intel.com] Sent: Thursday, June 24, 2010 9:06 PM To: Liran Liss; Roland Dreier; Aleksey Senin Cc: linux-rdma; mo...@voltaire.com; aleks...@voltaire.com; yift...@voltaire.com; Tziporet Koren; al...@voltaire.com Subject: RE: When IBoE will be merged to upstream? Regarding GID to Eth mappings, we discussed using the standard create_ah() Verb for this. In the kernel, create_ah() will call a generic address resolution function in the cma. The returned information will be copied back to user-space in a device- specific structure (since address handles are device-specific). This eliminates adding a new get_eth_l2_addr() ABI call (device-specific or not). In fact, this approach eliminates adding new ABIs for any kind of address translation... Does this seem reasonable? The current behavior of ibv_create_ah() requires that the caller provide the L2, and if needed, L3 addressing. Any translation between the L3 and L2 addressing must be done before the call is made. E.g. ibv_create_ah does not use the GID to query the SA to obtain LIDs. Why doesn't IBoE follow this same model? LL: because of the RoCE spec, which states that only GID addressing is used at the Verbs level. The address handle fields are unchanged, and the L2 fields (e.g., lid) are reserved. Note that in Ethernet, you normally don't specify L2 addresses at the transport level (i.e., sockets). Callers can use some out of band mechanism for the mapping, call rdma_resolve_addr, or use some standard networking routine. LL: this would require changes to the Verbs API. rdmacm programs addresses using user-space Verbs, or even just passes the application just address handle attributes... -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCHv8 07/11] ib_core: Add API to support IBoE from userspace
If we have a dedicated ABI call for this mapping, then it seems reasonable to have it device independent. However, this mapping is really only used when creating address handles. So, we can base the mapping on the (device specific) create_ah() flow, but provide generic mapping functions for all devices to use (this is kind of what happens now). Also, using create_ah() doesn't introduce an ABI call that is specific to ib--eth mappings. This is similar to how device-specific ib_reg_user_mr() functions call the generic ib_umem_get()... -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Roland Dreier Sent: Thursday, May 13, 2010 10:18 PM To: Sean Hefty Cc: 'Eli Cohen'; Eli Cohen; Linux RDMA list; ewg Subject: Re: [PATCHv8 07/11] ib_core: Add API to support IBoE from userspace Basically, what I want to understand is why does this change make sense? @@ -1139,6 +1139,10 @@ struct ib_device { struct ib_grh *in_grh, struct ib_mad *in_mad, struct ib_mad *out_mad); +int(*get_eth_l2_addr)(struct ib_device *device, u8 port, + union ib_gid *dgid, int sgid_idx, + u8 *mac, u16 *vlan_id, u8 *tagged); + Yes, that was pretty much my original question. Why do we have a verb for userspace to call a device-specific method to do the mapping? The layering seems wrong somewhere if we have a generic verb to do this mapping, but then put the mapping in device-specific code. - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RDMAoE verbs questions
Existing apps rely on transport_type == IBV_TRANSPORT_IB to indicate IB management is present. There are many examples of this. The art of API compatability is to not break existing old apps, so you don't get to change the meaning of transport_type == IBV_TRANSPORT_IB to mean 'it is only IB verbs like'. That breaks the API. Adding a new field to port_attr preserves functionality but not compatability. I hope you understand the difference. I understand exactly what you mean, but I want to propose another way of looking at the compatibility issue: IB management is a network service. Just as an administrator might mistakenly try to access an FTP server over a wrong eth interface, an IB admin can mistakenly run an IB management application (or a non-rdmacm app that is incompatible with RDMAoE) on a RDMAoE port. In both cases, the service is unreachable, but otherwise no harm is done. Basically, IB management is supported by RDMAoE --- you use the same verbs to access it but you don't get any answers... (Of course that it terms of implementation, we can choose to drop SMPs or non-CM MADs rather then sending them on the wire.) There is a tradeoff here between transparently supporting existing rdmacm apps versus making sure that non-rdmacm apps that may come across an RDMAoE port do not attempt to use it. Given the fact that rdmacm has become a preferred approach, and that for non-rdmacm apps the worst case effect is that of being unable to create a connection through a port that they were not designed to support to begin with, we prefer the approach that we proposed. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RDMAoE verbs questions
So? There are substantial semantic differences for *all* non-rdmacm applications. Even common ones like OpenMPI. You propose to ignore them? On the contrary! Any application that *does* care what the link layer is can look up a new field in port_attr (rather than a new node transport type). Applications that don't, both old and new, can continue as normal - no changes to the code are required. RDMAoE *is* IB transport over Ethernet - we don't want different devices with different node types exactly for this reason: applications shouldn't care if they are running over IB or RDMAoE, and shouldn't add another switch statement to support RDMAoE. Nonsense. RDMAoE is no such thing, it is utterly incompatible with the IB management model. It is some new protocol that is only about 90% compatible with IB. You are missing the point here - RDMAoE is 100% compatible with IB at the *transport* level, as reflected by the Verbs. The point that the management model is different is true, but irrelevant. The only transport-related issue that matters is addressing, but for user-space apps, it is completely abstracted by the rdmacm. Non-rdmacm apps fall into 2 main categories: 1. IB management+diagnostics apps - these are irrelevant for an Eth network anyway. 2. High-performance middleware such as MPI and SHMEM - these perform optimizations according to the link protocol anyway. So, all relevant apps will work great with either IB or RDMAoE, in a transparent manner. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RDMAoE verbs questions
Hi Paul, you are not missing anything - lookback communication will work in RDMAoE just as in IB. --Liran -Original Message- From: Paul Grun [mailto:pg...@systemfabricworks.com] Sent: Wednesday, December 02, 2009 10:55 AM To: 'Or Gerlitz'; Liran Liss Cc: 'Sean Hefty'; 'Jason Gunthorpe'; 'Eli Cohen'; 'Jeff Squyres'; linux-rdma@vger.kernel.org Subject: RE: RDMAoE verbs questions Why do you say that Or? I'm a hardware guy so can't comment on what the s/w supports/prevents, but I can see no reason why loopback wouldn't be supported. In fact, if I were guessing, I would guess that the spec currently under development would support loopback.tt What am I missing? -Paul -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Or Gerlitz Sent: Wednesday, December 02, 2009 12:09 AM To: Liran Liss Cc: Sean Hefty; Jason Gunthorpe; Eli Cohen; Jeff Squyres; linux-rdma@vger.kernel.org Subject: Re: RDMAoE verbs questions Liran Liss wrote: from an rdmacm app's point of view - there is no visible difference between IB and RDMAoE ports: both support the complete set of Verbs, just as any IB transport provider wrong, local (loopback) communication aren't supported with RDMAoE. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html