Re: [ewg] Re: [PATCH 2.6.31] ehca: Tolerate dynamic memory operations and huge pages
On Tue, 16 Jun 2009 09:10:39 -0700 Roland Dreier rdre...@cisco.com wrote: Yeah, the notifier code remains untouched as we still do not allow dynamic memory operations _while_ our module is loaded. The patch allows the driver to cope with DMEM operations that happened before the module was loaded, which might result in a non-contiguous memory layout. When the driver registers its global memory region in the system, the memory layout must be considered. We chose the term toleration instead of support to illustrate this. I see. So things just silently broke in some cases when the driver was loaded after operations you didn't tolerate? Anyway, thanks for the explanation. Well, things did not break silently. The registration of the MR failed with an error code which was reported to userspace. Will you push the patch for .31 or .32? Thanks, Alex ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH 3/9] ib_core: RDMAoE support only QP1
On Mon, Jun 15, 2009 at 02:26:42PM -0400, Hal Rosenstock wrote: Should ib_post_send_mad return some error on QP0 sends on RDMAoE ports ? You can't send anything over QP0 because it is not created and so there are no data structs corresponding to it. What QP1 sends are allowed ? Basically, all QP1 sends are allowed without any changes - QP1 functions as normal. However, RDMAoE will initially support only the CM. In the future, we can support additional QP1 services. Is it only SA sends which are faked ? Yes How are others handled ? These questions are important to what happens to the IB management/diag tools when invoked on an RDMAoE port. We need to be able to handle that scenario cleanly. You should be able to to send MADs over QP1 from the kernel. I did not make any tests as for sending MADs from userspace but I can't think of a reason why this would be a problem. Is something similar needed in ib_mad_port_close for handling no QP0 on RDMAoE ports in terms of destroy_mad_qp/cleanup_recv_queue calls ? No, becuase it is handled inside destroy_mad_qp(): if (!qp_info-qp) return; ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] RE: [ofa-general] [PATCH 4/9] ib_core: Add RDMAoE SA support
On Tue, Jun 16, 2009 at 10:27:33AM -0700, Sean Hefty wrote: + +w-member-multicast.rec.qkey = cpu_to_be32(0xc2c); How can a user control this? An app needs the same qkey for unicast traffic. In RDMAoE, the qkey has a fixed well-known value, which will be returned both by multicast and path queries. +atomic_inc(w-member-refcount); This needs to be moved below... I don't follow you - please explain. +static struct ib_sa_multicast * +eth_join_multicast(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, gfp_t gfp_mask, + int (*callback)(int status, + struct ib_sa_multicast *multicast), + void *context) +{ +struct mcast_device *dev; +struct eth_work *w; +struct mcast_member *member; +int err; + +dev = ib_get_client_data(device, mcast_client); +if (!dev) +return ERR_PTR(-ENODEV); + +member = kzalloc(sizeof *member, gfp_mask); +if (!member) +return ERR_PTR(-ENOMEM); + +w = kzalloc(sizeof *w, gfp_mask); +if (!w) { +err = -ENOMEM; +goto out1; +} +w-member = member; +w-device = device; +w-port_num = port_num; + +member-multicast.context = context; +member-multicast.callback = callback; +member-client = client; +member-multicast.rec.mgid = rec-mgid; +init_completion(member-comp); +atomic_set(member-refcount, 1); + +ib_sa_client_get(client); +INIT_WORK(w-work, eth_mcast_work_handler); +queue_work(mcast_wq, w-work); + +return member-multicast; The user could leave/destroy the multicast join request before the queued work item runs. We need to hold an additional reference on the member until the work item completes. Yes, thanks for catching this. I'll fix and resend. There's substantial differences in functionality between an IB multicast group and the Ethernet group. I would rather see the differences hidden by the rdma_cm, than the IB SA module. This is a question of transparency. The motivation is to enable non-cma apps that use the ib_sa_join_multicast() API to work without changes. diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 1865049..7bf9b5c 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -45,6 +45,7 @@ There's not a userspace interface into the sa_query module. How will those apps work, or apps that send MADs on QPs other than QP1? Currently, these apps won't work. Sending MADs directly on QPs other than QP1 will not work. However, we expect that a userpsace interface to the sa_query module will be implemented in the future, which will forward queries to the kernel module. Naturally, most kernel ULPs and user-level apps will use these standard interfaces instead of implementing MAD queries themselves. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] [PATCH 0/9] RDMAoE - RDMA over Ethernety
On Tue, Jun 16, 2009 at 11:16:28AM -0500, Steve Wise wrote: Hey Eli, Does this series implement UD/multicast support? I didn't see it with a quick perusal of the patches. Hi Steve, yes UD multicast is supported. Note that in this version of the pathces I use the broadcast MAC address (all FF's) for all multicast but this will be changed and GIDs will be mapped to multicast addresses. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH 3/9] ib_core: RDMAoE support only QP1
On Wed, Jun 17, 2009 at 7:10 AM, Eli Cohene...@dev.mellanox.co.il wrote: On Mon, Jun 15, 2009 at 02:26:42PM -0400, Hal Rosenstock wrote: Should ib_post_send_mad return some error on QP0 sends on RDMAoE ports ? You can't send anything over QP0 because it is not created and so there are no data structs corresponding to it. Yes, I understand that's the intention but I didn't see where a MAD posted to QP0 returns an error. Does that occur ? Or is it just silently dropped ? What QP1 sends are allowed ? Basically, all QP1 sends are allowed without any changes - QP1 functions as normal. However, RDMAoE will initially support only the CM. In the future, we can support additional QP1 services. So what happens with other QP1 sends now ? Do they go into hyperspace and then timeout ? Is it only SA sends which are faked ? Yes How are others handled ? These questions are important to what happens to the IB management/diag tools when invoked on an RDMAoE port. We need to be able to handle that scenario cleanly. You should be able to to send MADs over QP1 from the kernel. I did not make any tests as for sending MADs from userspace but I can't think of a reason why this would be a problem. There are a set of tools (infiniband-diags and ibutils) which do send MADs from userspace. I'm concerned that if someone tries these, the right thing will happen. -- Hal Is something similar needed in ib_mad_port_close for handling no QP0 on RDMAoE ports in terms of destroy_mad_qp/cleanup_recv_queue calls ? No, becuase it is handled inside destroy_mad_qp(): if (!qp_info-qp) return; ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH 3/9] ib_core: RDMAoE support only QP1
On Wed, Jun 17, 2009 at 07:14:56AM -0400, Hal Rosenstock wrote: On Wed, Jun 17, 2009 at 7:10 AM, Eli Cohene...@dev.mellanox.co.il wrote: On Mon, Jun 15, 2009 at 02:26:42PM -0400, Hal Rosenstock wrote: Should ib_post_send_mad return some error on QP0 sends on RDMAoE ports ? You can't send anything over QP0 because it is not created and so there are no data structs corresponding to it. Yes, I understand that's the intention but I didn't see where a MAD posted to QP0 returns an error. Does that occur ? Or is it just silently dropped ? But you don't have a struct ib_qp * for QP0 that you could use to post MADs to QP0... What QP1 sends are allowed ? Basically, all QP1 sends are allowed without any changes - QP1 functions as normal. However, RDMAoE will initially support only the CM. In the future, we can support additional QP1 services. So what happens with other QP1 sends now ? Do they go into hyperspace and then timeout ? SA joins and SA path queries are terminated at the driver. Otherwise, post sends on QP1 should be sent on the wire. There are a set of tools (infiniband-diags and ibutils) which do send MADs from userspace. I'm concerned that if someone tries these, the right thing will happen. What exactly do you mean? ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH 3/9] ib_core: RDMAoE support only QP1
On Wed, Jun 17, 2009 at 7:57 AM, Eli Cohene...@dev.mellanox.co.il wrote: On Wed, Jun 17, 2009 at 07:14:56AM -0400, Hal Rosenstock wrote: On Wed, Jun 17, 2009 at 7:10 AM, Eli Cohene...@dev.mellanox.co.il wrote: On Mon, Jun 15, 2009 at 02:26:42PM -0400, Hal Rosenstock wrote: Should ib_post_send_mad return some error on QP0 sends on RDMAoE ports ? You can't send anything over QP0 because it is not created and so there are no data structs corresponding to it. Yes, I understand that's the intention but I didn't see where a MAD posted to QP0 returns an error. Does that occur ? Or is it just silently dropped ? But you don't have a struct ib_qp * for QP0 that you could use to post MADs to QP0... Understood. Doesn't an error need to be returned for certain cases of invoking ib_post_send_mad on this new port type (at least qp0 and maybe some qp1 things) ? Look at user_mad.c:umad_write calling ib_post_send_mad(). What QP1 sends are allowed ? Basically, all QP1 sends are allowed without any changes - QP1 functions as normal. However, RDMAoE will initially support only the CM. In the future, we can support additional QP1 services. So what happens with other QP1 sends now ? Do they go into hyperspace and then timeout ? SA joins and SA path queries are terminated at the driver. Otherwise, post sends on QP1 should be sent on the wire. So these would just timeout if there's nothing there to consume them ? Seems better to error them out at the sender. There are a set of tools (infiniband-diags and ibutils) which do send MADs from userspace. I'm concerned that if someone tries these, the right thing will happen. What exactly do you mean? opensm and any IB diag (relying on QP1 and/or QP0) should error out at the sender. From your patches, I'm also not sure how user space sees this port type (umad needs to know about these port types). Maybe this needs ioctl exposure (and another change to umad API if it's done this way (rather than some other way which is what I think Sean was getting at in his email on node type)); sigh :-( -- Hal ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] RE: [ofa-general] [PATCH 2/9] ib_core: kernel API for GID --MAC translations
Why not just use IP to MAC calls? Or use the MAC as the GUID? We do use standard OS services to map the IP addresses (that were encoded in the GID) to MACs. GIDs encode IP addresses rather than MACs to enable users to use the node names that they are used to. Specifically, we will feed in all IP addresses that were assigned to the Ethernet interface to the corresponding port GID table. This will also enable routing in the future. The only exception is IPv6 link-local addresses, which already encodes the MAC. In this case, a simple algorithmic operation extracts the MAC without requiring ARP, etc. Do the GIDs follow the IB GID format? Yes. ___ general mailing list gene...@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
RE: [ewg] RE: OFED libraries - on the web at last
From: Sasha Khapyorsky [mailto:sashakv...@gmail.com] On Behalf Of Sasha Khapyorsky Sent: Wednesday, June 17, 2009 11:32 AM To: Todd Rimmer Cc: Tziporet Koren; ewg@lists.openfabrics.org Subject: Re: [ewg] RE: OFED libraries - on the web at last On 08:14 Wed 17 Jun , Todd Rimmer wrote: The issue isn't with the management tree. It's with the mvapich included in OFED 1.4.1 and previous versions. Then it should be fixed. [Todd Rimmer] Agreed, but customers will need a few releases of notice that they should recompile their apps. It links with libibcommon and as a result, Do you know why exactly was it linked (and how)? I searched over mvapich source and didn't find any references in sources and makefiles. It is only mentioned in spec files. So is there any issue really? [Todd Rimmer] Yes, the .spec file controls what libraries mvapich uses, that same list of libraries is then part of mpicc and mpif77 and used to link all applications built with mvapich. ALL end user and ISV apps linked with mvapich also link with libibcommon. This means that in moving to OFED 1.5 end users and ISVs would have to recompile all their apps. Such a change should not be taken lightly. Instead we should provide what is needed in OFED 1.5 such that existing app binaries will continue to work (might be as simple as stubbed library). Then we can issue notice to end users and ISVs that their apps will need to be rebuilt and allow them at least 1 release to accomplish that. AFAIR libibcommon removal was discussed on the list couple of times before, during and after OFED-1.4. [Todd Rimmer] Yes, but unfortunately it didn't come into practice in mvapich in OFED 1.4.1 or previous. ABI compatibility is an important requirement. In this case the ABI is that of applications built with mvapich. Is it ideal, NO. But we can't penalize customers. I suspect a dummy library would satisfy the mvapich need, but that would need to be confirmed (FYI, OFED 1.4.1 libibumad also references ibcommon). Sasha ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] [PATCH 2/9] ib_core: kernel API for GID --MAC translations
Hum, This is a very tricky subject. Co-mingling the IB GID address space and the IPv6 address space like this is not really something that was envisioned from the IBA side. Doesn't the IB spec say that an IB GID *is* an IPv6 address? So in theory it should be OK; however I don't think in practice anyone paid attention to making sure that the IB GID space works as an IPv6 address. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] RE: [ofa-general] [PATCH 0/9] RDMAoE - RDMA over Ethernet
Eli Cohen wrote, RDMA over Ethernet (RDMAoE) allows running the IB transport protocol over Ethernet, providing IB capabilities for Ethernet fabrics. The packets are standard Ethernet frames with an Ethertype, an IB GRH, unmodified IB transport headers and payload. HCA RDMAoE ports are no different than regular IB ports from the RDMA stack perspective. This seems like a hack to try and shoehorn this in under the IB portion of the stack. It makes more sense for it to be it's own transport type, a peer to IB and iWarp. IB subnet management and SA services are not required for RDMAoE operation; Ethernet management practices are used instead. In Ethernet, nodes are commonly referred to by applications by means of an IP address. RDMAoE treats IP addresses that were assigned to the corresponding Ethernet port as GIDs, and makes use of the IP stack to bind a destination address to the corresponding netdevice (just as the CMA does today for IB and iWARP) and to obtain its L2 MAC addresses. Since it does not require SM or SA services, then this is another reason to make it it's own transport type rather than try to fit in under the IB core services and then emulate the expected behaviour that the core mid-layer and ULPs expect for things like SA queries. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] [PATCH 2/9] ib_core: kernel API for GID --MAC translations
On Wed, Jun 17, 2009 at 11:20:26AM -0700, Roland Dreier wrote: Hum, This is a very tricky subject. Co-mingling the IB GID address space and the IPv6 address space like this is not really something that was envisioned from the IBA side. Doesn't the IB spec say that an IB GID *is* an IPv6 address? So in theory it should be OK; however I don't think in practice anyone paid attention to making sure that the IB GID space works as an IPv6 address. It is like an IPv6 address but it was expressly envisioned to be a seperate space. The IBA authors copied many of the conventions from IPv6 for numbering this new space, like link local, and multicast prefixes, but it was not intended to be co-mingled. That said, because of the incredible similarity it could probably be co-mingled, with some research and validation, but IIRC the main reason this wasn't done from the start is that the IETF wasn't interested in supplying a protocol number and the definition work to make IB over IPv6 a standard. So instead we have IB over GRH, which is 99% the same So, I didn't look closely enough, but what was the ethertype that is used here in this patch set? Hopefully not IPv6. Therin is the oddness, if the main IPv6 routing table is used to direct packets that are not labeled with the IPv6 ethertype that is very confusing - on the other hand if the RDMA packets are labeled with the IPv6 type then you need IETF to supply a protocol number for this. In either case, this is a good point, it is difficult to imagine including this work in Linux until either IEEE or IETF supply a number.. Jason ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] [PATCH 2/9] ib_core: kernel API for GID --MAC translations
It is like an IPv6 address but it was expressly envisioned to be a seperate space. The IBA authors copied many of the conventions from IPv6 for numbering this new space, like link local, and multicast prefixes, but it was not intended to be co-mingled. Well (I've quoted this many times): IBA section 4.1: A GID is a valid 128-bit IPv6 address (per RFC 2373) with additional properties / restrictions defined within IBA... People often try to claim that this sentence doesn't mean what it very explicitly and clearly says, and certainly I believe that existing practice doesn't comply with the IBA spec, but I don't see how anyone can say that a truly compliant IB GID is not a real IPv6 address. So, I didn't look closely enough, but what was the ethertype that is used here in this patch set? Hopefully not IPv6. I don't think it's specified in the code -- presumably in HCA FW. Which is an issue as you say -- do we have an IEEE ethertype yet? And if we don't use the IPv6 ethertype, then is multicast going to work well (if the code is moved away from just broadcasting everything)? I doubt that switch IGMP snooping works well for non-IP ethertypes -- in fact I wonder how well existing switches handle IPv6 multicast ;) - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] [PATCH 2/9] ib_core: kernel API for GID --MAC translations
On Wed, Jun 17, 2009 at 11:38:43AM -0700, Roland Dreier wrote: It is like an IPv6 address but it was expressly envisioned to be a seperate space. The IBA authors copied many of the conventions from IPv6 for numbering this new space, like link local, and multicast prefixes, but it was not intended to be co-mingled. Well (I've quoted this many times): IBA section 4.1: A GID is a valid 128-bit IPv6 address (per RFC 2373) with additional properties / restrictions defined within IBA... People often try to claim that this sentence doesn't mean what it very explicitly and clearly says, and certainly I believe that existing practice doesn't comply with the IBA spec, but I don't see how anyone can say that a truly compliant IB GID is not a real IPv6 address. Yes, I know.. This is a crummy area in IBA, the counter quote to 4.1, is 5.2.2: ''Note however, that IBA does not define a relationship between a device GID and IPv6 address (ie there is no defined mapping between GID and IPv6 address for and IB device or port)'' Which I take to mean that the GID follows all the conventions of RFC 2373 but is a distinct address family. Fundamentally the question that is unanswered by IBA is if a GID and IPv6 address that have the same value represent the same interface and host. FWIW, I have long sought to keep the GID and IPv6 spaces as aligned and would prefer to be done with this ambiguity and have a 1:1 mapping.. So, I didn't look closely enough, but what was the ethertype that is used here in this patch set? Hopefully not IPv6. I don't think it's specified in the code -- presumably in HCA FW. Which is an issue as you say -- do we have an IEEE ethertype yet? And if we don't use the IPv6 ethertype, then is multicast going to work well (if the code is moved away from just broadcasting everything)? I doubt that switch IGMP snooping works well for non-IP ethertypes -- in fact I wonder how well existing switches handle IPv6 multicast ;) Hmm, murky indeed. Your point about IGMPv6 is well made. The problem is that IB GRHs are not IPv6 headers and have different numerology for the Next Header field. Ie in IPv6 Next Header 0x1B is RFC 908, while in GRH it is a BTH. Labeling GRHs with an IPv6 ethertype is fundamentally wrong. But also highly desirable for things like IGMPv6 snooping, etc. There is obviously alot of possible trade offs here, and it is impossible to judge any proposal until the issue of IBoE or RDMAoE and the intended goal of this thing is laid to rest. Jason ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] [PATCH 2/9] ib_core: kernel API for GID --MAC translations
Hmm, murky indeed. Your point about IGMPv6 is well made. The problem is that IB GRHs are not IPv6 headers and have different numerology for the Next Header field. Ie in IPv6 Next Header 0x1B is RFC 908, while in GRH it is a BTH. Labeling GRHs with an IPv6 ethertype is fundamentally wrong. Yes, but the next header is the only issue I know of. Since 0x1b is already assigned as an IPv6 next header protocol, we would have to get a new value assigned. However once a non-conflicting value is chosen, then an IB GRH really is an IPv6 header and in that case I think using the IPv6 ethertype too would make things work much better -- eg IB traffic actually could be forwarded by an IPv6 router with no additional work required. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] [GIT PULL - ofed-1.5] cxgb3 RHEL5.2 backports
Hey Vlad, Please pull from: ssh://v...@sofa.openfabrics.org/~swise/scm/ofed_kernel.git ofed_1_5 This adds RHEL5.2 backports for cxgb3. Thanks, Steve. --- commit 9167dcdbc40252562fd826acc0598853456deafc Author: Steve Wise sw...@opengridcomputing.com Date: Wed Jun 17 19:16:31 2009 -0500 cxgb3: RHEL5.2 backports. Compiles ok. Signed-off-by: Steve Wise sw...@opengridcomputing.com .../backport/2.6.18-EL5.2/include/linux/skbuff.h | 102 .../cxgb3_0001_backout_multq_netdeviceops.patch| 228 .../2.6.18-EL5.2/cxgb3_0002_undo_250.patch | 164 ++ .../2.6.18-EL5.2/cxgb3_0004_undo_240.patch | 86 +++ ...xgb3_0008_pci_dma_mapping_error_to_2_6_26.patch | 17 + .../backport/2.6.18-EL5.2/cxgb3_0010_napi.patch| 592 .../backport/2.6.18-EL5.2/cxgb3_0020_sysfs.patch | 202 +++ .../backport/2.6.18-EL5.2/cxgb3_0030_sset.patch| 34 ++ .../2.6.18-EL5.2/cxgb3_0100_remove_lro.patch | 391 + .../2.6.18-EL5.2/cxgb3_0110_provider_sysfs.patch | 120 .../2.6.18-EL5.2/cxgb3_0120_fixwarnings.patch | 39 ++ 11 files changed, 1975 insertions(+), 0 deletions(-) ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg