drivers/staging/rdma/hfi1/sdma.c:740: bad if test ?
Hello there, drivers/staging/rdma/hfi1/sdma.c:740:17: warning: logical ‘and’ of mutually exclusive tests is always false [-Wlogical-op] Source code is if (count < 64 && count> 32768) return SDMA_DESCQ_CNT; Maybe better code if (count < 64 || count> 32768) return SDMA_DESCQ_CNT; Regards David Binderman -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups
On Fri, Sep 11, 2015 at 05:40:41PM -0400, Doug Ledford wrote: > Then a simple mcast_expire_task that runs every 10 minutes or so and > leaves any send-only groups that haven't had a packet timestamp update > in more than some arbitrary amount of time would be a very simple > addition to make. That all makes sense to me and addresses the backlog issue Christoph mentioned. It would also be fairly simple to do send-only join properly with the above, as we simply switch to polling the SA to detect if the join is good or not on the same timer that would expire it. > > If it isn't subscribed to the broadcast MLID, it is violating MUST > > statements in the RFC... > > Yeah, my comment was in reference to whether or not it would receive a > multicast on broadcast packet and forward it properly. It would be hugely broken to be sensitive to the MLID when working with multicast routing. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups
On 09/11/2015 04:38 PM, Jason Gunthorpe wrote: > On Fri, Sep 11, 2015 at 04:09:49PM -0400, Doug Ledford wrote: >> On 09/11/2015 02:39 PM, Jason Gunthorpe wrote: >>> On Thu, Sep 10, 2015 at 09:21:05PM -0400, Doug Ledford wrote: During the recent rework of the mcast handling in ipoib, the join task for regular and send-only joins were merged. In the old code, the comments indicated that the ipoib driver didn't send enough information to auto-create IB multicast groups when the join was a send-only join. The reality is that the comments said we didn't, but we actually did. Since we merged the two join tasks, we now follow the comments and don't auto-create IB multicast groups for an ipoib send-only multicast join. This has been reported to cause problems in certain environments that rely on this behavior. Specifically, if you have an IB <-> Ethernet gateway then there is a fundamental mismatch between the methodologies used on the two fabrics. On Ethernet, an app need not subscribe to a multicast group, merely listen. >>> >>> This should probably be clarified. On all IP networks IGMP/MLD is used >>> to advertise listeners. >>> >>> A IB/Eth gateway is a router, and IP routers are expected to process >>> IGMP - so the gateway certainly can (and maybe must) be copying >>> groups declared with IGMP from the eth side into listeners on IB MGIDs >> >> Obviously, the gateway in question currently is not doing this. > > Sure, my remark was the clarify the commit comment so people don't think > this is OK/expected behavior from a gateway. > >> We could drop the queue backlog entirely and just send to broadcast >> when the multicast group is unsubscribed. > > I'm pretty sure that would upset the people who care about this > stuff.. Steady state operation has to eventually move to the optimal > MLID. I didn't mean to imply that we wouldn't attempt to join the group still. Just that right now the process is this: send_mcast detect no-mcast group initiate sendonly join, while queueing packet to be sent process join completion success - send queued packet (which has been delayed) fail - leave packet on queue, set backoff timer, try join again after timer expires, if we exceed the number of backoff/join attempts, drop packet The people I was on the phone with indicated that they are seeing some packet loss of mcast packets on these sendonly groups. For one thing, that backlog queue we have while waiting for the join to complete is capped at 3 deep and drops anything beyond that. So it is easy to imagine that even a small burst of packets could cause a drop. But this process also introduces delay on the packets being sent. My comment above then was to suggest the following change: send_mcast no-mcast group initiate sendonly join simultaneously send current packet to broadcast group process join completion success - mark mcast group live, no queue backlog to send fail - retry as above mcast group, but group still no live send current packet to broadcast group mcast group, group is live send current packet to mcast group The major change here is that we would never queue the packets any more. If the group is not already successfully joined, then we simply send to the broadcast group instead of queueing. Once the group is live, we start sending to the group. This would eliminate packet delays assuming that the gateway properly forwards multicast packets received on the broadcast group versus on a specific IB multicast group. Then a simple mcast_expire_task that runs every 10 minutes or so and leaves any send-only groups that haven't had a packet timestamp update in more than some arbitrary amount of time would be a very simple addition to make. >> Well, we've already established that the gateway device might be well be >> broken. That makes one wonder if this will work or if it might be >> broken too. > > If it isn't subscribed to the broadcast MLID, it is violating MUST > statements in the RFC... Yeah, my comment was in reference to whether or not it would receive a multicast on broadcast packet and forward it properly. >> and so this has been happening since forever in OFED (the above is from >> 1.5.4.1). > > But has this has been dropped from the new 3.x series that track > upstream exactly? I haven't check that (I don't have any of the later OFEDs downloaded, and I didn't see a git repo for the 3.x series under Vlad's area on OpenFabrics git server). >> our list we add. Because the sendonly groups are not tracked at the net >> core level, our only option is to move them all to the remove list and >> when we get another sendonly packet, rejoin. Unless we want them to >> stay around forever. But since they aren't real send-only joins, they >> are full joins where we simply ignore the incoming data, leaving them >> around seems a bad idea.
Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups
On Fri, Sep 11, 2015 at 04:09:49PM -0400, Doug Ledford wrote: > On 09/11/2015 02:39 PM, Jason Gunthorpe wrote: > > On Thu, Sep 10, 2015 at 09:21:05PM -0400, Doug Ledford wrote: > >> During the recent rework of the mcast handling in ipoib, the join > >> task for regular and send-only joins were merged. In the old code, > >> the comments indicated that the ipoib driver didn't send enough > >> information to auto-create IB multicast groups when the join was a > >> send-only join. The reality is that the comments said we didn't, but > >> we actually did. Since we merged the two join tasks, we now follow > >> the comments and don't auto-create IB multicast groups for an ipoib > >> send-only multicast join. This has been reported to cause problems > >> in certain environments that rely on this behavior. Specifically, > >> if you have an IB <-> Ethernet gateway then there is a fundamental > >> mismatch between the methodologies used on the two fabrics. On > >> Ethernet, an app need not subscribe to a multicast group, merely > >> listen. > > > > This should probably be clarified. On all IP networks IGMP/MLD is used > > to advertise listeners. > > > > A IB/Eth gateway is a router, and IP routers are expected to process > > IGMP - so the gateway certainly can (and maybe must) be copying > > groups declared with IGMP from the eth side into listeners on IB MGIDs > > Obviously, the gateway in question currently is not doing this. Sure, my remark was the clarify the commit comment so people don't think this is OK/expected behavior from a gateway. > We could drop the queue backlog entirely and just send to broadcast > when the multicast group is unsubscribed. I'm pretty sure that would upset the people who care about this stuff.. Steady state operation has to eventually move to the optimal MLID. > Well, we've already established that the gateway device might be well be > broken. That makes one wonder if this will work or if it might be > broken too. If it isn't subscribed to the broadcast MLID, it is violating MUST statements in the RFC... > and so this has been happening since forever in OFED (the above is from > 1.5.4.1). But has this has been dropped from the new 3.x series that track upstream exactly? > our list we add. Because the sendonly groups are not tracked at the net > core level, our only option is to move them all to the remove list and > when we get another sendonly packet, rejoin. Unless we want them to > stay around forever. But since they aren't real send-only joins, they > are full joins where we simply ignore the incoming data, leaving them > around seems a bad idea. It doesn't make any sense to work like that. As is, the send-only side looks pretty messed up to me. It really needs to act like ND, and yah, that is a big change. Just to be clear, I'm not entirely opposed to an OFED compatability module option, but lets understand how this is broken, what the fix is we want to see for mainline and why the OFED 'solution' is not acceptable for mainline. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups
On 09/11/2015 02:39 PM, Jason Gunthorpe wrote: > On Thu, Sep 10, 2015 at 09:21:05PM -0400, Doug Ledford wrote: >> During the recent rework of the mcast handling in ipoib, the join >> task for regular and send-only joins were merged. In the old code, >> the comments indicated that the ipoib driver didn't send enough >> information to auto-create IB multicast groups when the join was a >> send-only join. The reality is that the comments said we didn't, but >> we actually did. Since we merged the two join tasks, we now follow >> the comments and don't auto-create IB multicast groups for an ipoib >> send-only multicast join. This has been reported to cause problems >> in certain environments that rely on this behavior. Specifically, >> if you have an IB <-> Ethernet gateway then there is a fundamental >> mismatch between the methodologies used on the two fabrics. On >> Ethernet, an app need not subscribe to a multicast group, merely >> listen. > > This should probably be clarified. On all IP networks IGMP/MLD is used > to advertise listeners. > > A IB/Eth gateway is a router, and IP routers are expected to process > IGMP - so the gateway certainly can (and maybe must) be copying > groups declared with IGMP from the eth side into listeners on IB MGIDs Obviously, the gateway in question currently is not doing this. > We may also be making a mistake in IPoIB - if the send side MGID join > fails, the send should probably go to the broadcast, and the join > retried from time to time. This would at least let the gateway > optimize a bit more by only creating groups in active use. That's not *too* far off from what we do. This change could be made without a huge amount of effort. Right now we queue up the packet and retry the join on a timer. Only after all of the join attempts have failed their timed retries do we dequeue and drop the packets. We could drop the queue backlog entirely and just send to broadcast when the multicast group is unsubscribed. > That would emulate the ethernet philosophy of degrade to broadcast. Indeed. > TBH, fixing broken gateway devices by sending to broadcast appeals to > me more than making a module option.. Well, we've already established that the gateway device might be well be broken. That makes one wonder if this will work or if it might be broken too. >> on the IB side of the gateway. There are instances of installations >> with 100's (maybe 1000's) of multicast groups where static creation >> of all the groups is not practical that rely upon the send-only > > With my last patch the SM now has enough information to auto-create > the wonky send-only join attempts, if wanted to. It just needs to fill > in tclass, so it certainly is possible to address this long term > without a kernel patch. > >> joins creating the IB multicast group in order to function, so to >> preserve these existing installations, add a module option to the >> ipoib module to restore the previous behavior. > > .. but an option to restore behavior of an older kernel version makes > sense - did we really have a kernel that completely converted a > send-only join to full join&create? I just re-checked to verify this and here's what I found: kernels v3.10, v3,19, and v4.0 have this in ipoib_mcast_sendonly_join: if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) { ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n"); return -EBUSY; } rec.mgid = mcast->mcmember.mgid; rec.port_gid = priv->local_gid; rec.pkey = cpu_to_be16(priv->pkey); comp_mask = IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | IB_SA_MCMEMBER_REC_JOIN_STATE; mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port, &rec, comp_mask, GFP_ATOMIC, ipoib_mcast_sendonly_join_complete, mcast); kernel v4.1 and on no longer have sendonly_join as a separate function. So, to answer your question, no, upstream didn't do this. This, however, is carried in OFED: rec.mgid = mcast->mcmember.mgid; rec.port_gid = priv->local_gid; rec.pkey = cpu_to_be16(priv->pkey); comp_mask = IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | IB_SA_MCMEMBER_REC_JOIN_STATE; if (priv->broadcast) { comp_mask |= IB_SA_MCMEMBER_REC_QKEY | IB_SA_MCMEMBER_REC_MTU_SELECTOR | IB_SA_MCMEMBER_REC_MTU | IB_SA_MCM
RE: [PATCH 0/7] devcg: device cgroup extension for rdma resource
> > Trying to limit the number of QPs that an app can allocate, > > therefore, just limits how much of the address space an app can use. > > There's no clear link between QP limits and HW resource limits, > > unless you assume a very specific underlying implementation. > > Isn't that the point though? We have several vendors with hardware > that does impose hard limits on specific resources. There is no way to > avoid that, and ultimately, those exact HW resources need to be > limited. My point is that limiting the number of QPs that an app can allocate doesn't necessarily mean anything. Is allocating 1000 QPs with 1 entry each better or worse than 1 QP with 10,000 entries? Who knows? > If we want to talk about abstraction, then I'd suggest something very > general and simple - two limits: > '% of the RDMA hardware resource pool' (per device or per ep?) > 'bytes of kernel memory for RDMA structures' (all devices) Yes - this makes more sense to me. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
On Fri, Sep 11, 2015 at 07:22:56PM +, Hefty, Sean wrote: > Trying to limit the number of QPs that an app can allocate, > therefore, just limits how much of the address space an app can use. > There's no clear link between QP limits and HW resource limits, > unless you assume a very specific underlying implementation. Isn't that the point though? We have several vendors with hardware that does impose hard limits on specific resources. There is no way to avoid that, and ultimately, those exact HW resources need to be limited. If we want to talk about abstraction, then I'd suggest something very general and simple - two limits: '% of the RDMA hardware resource pool' (per device or per ep?) 'bytes of kernel memory for RDMA structures' (all devices) That comfortably covers all the various kinds of hardware we support in a reasonable fashion. Unless there really is a reason why we need to constrain exactly and precisely PD/QP/MR/AH (I can't think of one off hand) The 'RDMA hardware resource pool' is a vendor-driver-device specific thing, with no generic definition beyond something that doesn't fit in the other limit. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups
On 09/11/2015 02:21 PM, Jason Gunthorpe wrote: > On Fri, Sep 11, 2015 at 09:40:25AM -0500, Christoph Lameter wrote: >> On Thu, 10 Sep 2015, Doug Ledford wrote: >> >>> +* 1) ifdown/ifup >>> +* 2) a regular mcast join/leave happens and we run >>> +*ipoib_mcast_restart_task >>> +* 3) a REREGISTER event comes in from the SM >>> +* 4) any other event that might cause a mcast flush >> >> Could we have a timeout and leave the multicast group on process >> exit? > > At a minimum, when the socket that did the send closes the send-only > could be de-refed.. If we kept a ref count, but we don't. Tracking this is not a small change. > But really send-only in IB is very similar to neighbour discovery, and > should work the same, with a timer, etc. In my mind it would be ideal > to even add them to the ND table, if the core would support that.. > > Jason > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Hello, Parav. On Fri, Sep 11, 2015 at 10:09:48PM +0530, Parav Pandit wrote: > > If you're planning on following what the existing memcg did in this > > area, it's unlikely to go well. Would you mind sharing what you have > > on mind in the long term? Where do you see this going? > > At least current thoughts are: central entity authority monitors fail > count and new threashold count. > Fail count - as similar to other indicates how many time resource > failure occured > threshold count - indicates upto what this resource has gone upto in > usage. (application might not be able to poll on thousands of such > resources entries). > So based on fail count and threshold count, it can tune it further. So, regardless of the specific resource in question, implementing adaptive resource distribution requires more than simple thresholds and failcnts. The very minimum would be a way to exert reclaim pressure and then a way to measure how much lack of a given resource is affecting the workload. Maybe it can adaptively lower the limits and then watch how often allocation fails but that's highly unlikely to be an effective measure as it can't do anything to hoarders and the frequency of allocation failure doesn't necessarily correlate with the amount of impact the workload is getting (it's not a measure of usage). This is what I'm awry about. The kernel-userland interface here is cut pretty low in the stack leaving most of arbitration and management logic in the userland, which seems to be what people wanted and that's fine, but then you're trying to implement an intelligent resource control layer which straddles across kernel and userland with those low level primitives which inevitably would increase the required interface surface as nobody has enough information. Just to illustrate the point, please think of the alsa interface. We expose hardware capabilities pretty much as-is leaving management and multiplexing to userland and there's nothing wrong with it. It fits better that way; however, we don't then go try to implement cgroup controller for PCM channels. To do any high-level resource management, you gotta do it where the said resource is actually managed and arbitrated. What's the allocation frequency you're expecting? It might be better to just let allocations themselves go through the agent that you're planning. You sure can use cgroup membership to identify who's asking tho. Given how the whole thing is architectured, I'd suggest thinking more about how the whole thing should turn out eventually. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/7] devcg: device cgroup extension for rdma resource
> So, the existence of resource limitations is fine. That's what we > deal with all the time. The problem usually with this sort of > interfaces which expose implementation details to users directly is > that it severely limits engineering manuevering space. You usually > want your users to express their intentions and a mechanism to > arbitrate resources to satisfy those intentions (and in a way more > graceful than "we can't, maybe try later?"); otherwise, implementing > any sort of high level resource distribution scheme becomes painful > and usually the only thing possible is preventing runaway disasters - > you don't wanna pin unused resource permanently if there actually is > contention around it, so usually all you can do with hard limits is > overcommiting limits so that it at least prevents disasters. I agree with Tejun that this proposal is at the wrong level of abstraction. If you look at just trying to limit QPs, it's not clear what that attempts to accomplish. Conceptually, a QP is little more than an addressable endpoint. It may or may not map to HW resources (for Intel NICs it does not). Even when HW resources do back the QP, the hardware is limited by how many QPs can realistically be active at any one time, based on how much caching is available in the NIC. Trying to limit the number of QPs that an app can allocate, therefore, just limits how much of the address space an app can use. There's no clear link between QP limits and HW resource limits, unless you assume a very specific underlying implementation. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Hello, Parav. On Fri, Sep 11, 2015 at 10:17:42PM +0530, Parav Pandit wrote: > IO controller and applications are mature in nature. > When IO controller throttles the IO, applications are pretty mature > where if IO takes longer to complete, there is possibly almost no way > to cancel the system call or rather application might not want to > cancel the IO at least the non asynchronous one. I was more talking about the fact that they allow resources to be consumed when they aren't contended. > So application just notice lower performance than throttled way. > Its really not possible at RDMA level with RDMA resource to hold up > resource creation call for longer time, because reusing existing > resource with failed status can likely to give better performance. > As Doug explained in his example, many RDMA resources as its been used > by applications are relatively long lived. So holding ups resource > creation while its taken by other process will certainly will look bad > on application performance front compare to returning failure and > reusing existing one once its available or once new one is available. I'm not really sold on the idea that this can be used to implement performance based resource distribution. I'll write more about that on the other subthread. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups
On Thu, Sep 10, 2015 at 09:21:05PM -0400, Doug Ledford wrote: > During the recent rework of the mcast handling in ipoib, the join > task for regular and send-only joins were merged. In the old code, > the comments indicated that the ipoib driver didn't send enough > information to auto-create IB multicast groups when the join was a > send-only join. The reality is that the comments said we didn't, but > we actually did. Since we merged the two join tasks, we now follow > the comments and don't auto-create IB multicast groups for an ipoib > send-only multicast join. This has been reported to cause problems > in certain environments that rely on this behavior. Specifically, > if you have an IB <-> Ethernet gateway then there is a fundamental > mismatch between the methodologies used on the two fabrics. On > Ethernet, an app need not subscribe to a multicast group, merely > listen. This should probably be clarified. On all IP networks IGMP/MLD is used to advertise listeners. A IB/Eth gateway is a router, and IP routers are expected to process IGMP - so the gateway certainly can (and maybe must) be copying groups declared with IGMP from the eth side into listeners on IB MGIDs We may also be making a mistake in IPoIB - if the send side MGID join fails, the send should probably go to the broadcast, and the join retried from time to time. This would at least let the gateway optimize a bit more by only creating groups in active use. That would emulate the ethernet philosophy of degrade to broadcast. TBH, fixing broken gateway devices by sending to broadcast appeals to me more than making a module option.. > on the IB side of the gateway. There are instances of installations > with 100's (maybe 1000's) of multicast groups where static creation > of all the groups is not practical that rely upon the send-only With my last patch the SM now has enough information to auto-create the wonky send-only join attempts, if wanted to. It just needs to fill in tclass, so it certainly is possible to address this long term without a kernel patch. > joins creating the IB multicast group in order to function, so to > preserve these existing installations, add a module option to the > ipoib module to restore the previous behavior. .. but an option to restore behavior of an older kernel version makes sense - did we really have a kernel that completely converted a send-only join to full join&create? > + * An additional problem is that if we auto-create the IB > + * mcast group in response to a send-only action, then we > + * will be the creating entity, but we will not have any > + * mechanism by which we will track when we should leave > + * the group ourselves. We will occasionally leave and > + * re-join the group when these events occur: I would drop the langauge of creating-entity, that isn't something from the IBA. The uncontrolled lifetime of the join is not related to creating or not. > + * 2) a regular mcast join/leave happens and we run > + *ipoib_mcast_restart_task Really? All send only mgids are discarded at that point? Ugly. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups
On Fri, Sep 11, 2015 at 09:40:25AM -0500, Christoph Lameter wrote: > On Thu, 10 Sep 2015, Doug Ledford wrote: > > > +* 1) ifdown/ifup > > +* 2) a regular mcast join/leave happens and we run > > +*ipoib_mcast_restart_task > > +* 3) a REREGISTER event comes in from the SM > > +* 4) any other event that might cause a mcast flush > > Could we have a timeout and leave the multicast group on process > exit? At a minimum, when the socket that did the send closes the send-only could be de-refed.. But really send-only in IB is very similar to neighbour discovery, and should work the same, with a timer, etc. In my mind it would be ideal to even add them to the ND table, if the core would support that.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] IB/ehca: Deprecate driver, move to staging, schedule deletion
The ehca driver is only supported on IBM machines with a custom EBus. As they have opted to build their newer machines using more industry standard technology and haven't really been pushing EBus capable machines for a while, this driver can now safely be moved to the staging area and scheduled for eventual removal. This plan was brought to IBM's attention and received their sign-off. Cc: al...@linux.vnet.ibm.com Cc: hngu...@de.ibm.com Cc: rai...@de.ibm.com Cc: stefan.rosc...@de.ibm.com Signed-off-by: Doug Ledford --- drivers/infiniband/Kconfig | 1 - drivers/infiniband/hw/Makefile | 1 - drivers/staging/rdma/Kconfig| 2 ++ drivers/staging/rdma/Makefile | 1 + drivers/{infiniband/hw => staging/rdma}/ehca/Kconfig| 3 ++- drivers/{infiniband/hw => staging/rdma}/ehca/Makefile | 0 drivers/staging/rdma/ehca/TODO | 4 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_av.c | 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_classes.h | 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_classes_pSeries.h | 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_cq.c | 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_eq.c | 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_hca.c | 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_irq.c | 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_irq.h | 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_iverbs.h | 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_main.c| 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_mcast.c | 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_mrmw.c| 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_mrmw.h| 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_pd.c | 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_qes.h | 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_qp.c | 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_reqs.c| 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_sqp.c | 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_tools.h | 0 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_uverbs.c | 0 drivers/{infiniband/hw => staging/rdma}/ehca/hcp_if.c | 0 drivers/{infiniband/hw => staging/rdma}/ehca/hcp_if.h | 0 drivers/{infiniband/hw => staging/rdma}/ehca/hcp_phyp.c | 0 drivers/{infiniband/hw => staging/rdma}/ehca/hcp_phyp.h | 0 drivers/{infiniband/hw => staging/rdma}/ehca/hipz_fns.h | 0 drivers/{infiniband/hw => staging/rdma}/ehca/hipz_fns_core.h| 0 drivers/{infiniband/hw => staging/rdma}/ehca/hipz_hw.h | 0 drivers/{infiniband/hw => staging/rdma}/ehca/ipz_pt_fn.c| 0 drivers/{infiniband/hw => staging/rdma}/ehca/ipz_pt_fn.h| 0 36 files changed, 9 insertions(+), 3 deletions(-) rename drivers/{infiniband/hw => staging/rdma}/ehca/Kconfig (69%) rename drivers/{infiniband/hw => staging/rdma}/ehca/Makefile (100%) create mode 100644 drivers/staging/rdma/ehca/TODO rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_av.c (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_classes.h (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_classes_pSeries.h (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_cq.c (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_eq.c (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_hca.c (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_irq.c (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_irq.h (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_iverbs.h (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_main.c (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_mcast.c (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_mrmw.c (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_mrmw.h (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_pd.c (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_qes.h (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_qp.c (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_reqs.c (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_sqp.c (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_tools.h (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_uverbs.c (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/hcp_if.c (100%) rename drivers/{infiniband/hw => staging/rdma}/ehca/hcp_if.h (100%) ren
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
> cpuset is a special case but think of cpu, memory or io controllers. > Their resource distribution schemes are a lot more developed than > what's proposed in this patchset and that's a necessity because nobody > wants to cripple their machines for resource control. IO controller and applications are mature in nature. When IO controller throttles the IO, applications are pretty mature where if IO takes longer to complete, there is possibly almost no way to cancel the system call or rather application might not want to cancel the IO at least the non asynchronous one. So application just notice lower performance than throttled way. Its really not possible at RDMA level with RDMA resource to hold up resource creation call for longer time, because reusing existing resource with failed status can likely to give better performance. As Doug explained in his example, many RDMA resources as its been used by applications are relatively long lived. So holding ups resource creation while its taken by other process will certainly will look bad on application performance front compare to returning failure and reusing existing one once its available or once new one is available. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
On Fri, Sep 11, 2015 at 10:04 PM, Tejun Heo wrote: > Hello, Parav. > > On Fri, Sep 11, 2015 at 09:56:31PM +0530, Parav Pandit wrote: >> Resource run away by application can lead to (a) kernel and (b) other >> applications left out with no resources situation. > > Yeap, that this controller would be able to prevent to a reasonable > extent. > >> Both the problems are the target of this patch set by accounting via cgroup. >> >> Performance contention can be resolved with higher level user space, >> which will tune it. > > If individual applications are gonna be allowed to do that, what's to > prevent them from jacking up their limits? I should have been more explicit. I didnt mean the application to control which is allocating it. > So, I assume you're > thinking of a central authority overseeing distribution and enforcing > the policy through cgroups? > Exactly. >> Threshold and fail counters are on the way in follow on patch. > > If you're planning on following what the existing memcg did in this > area, it's unlikely to go well. Would you mind sharing what you have > on mind in the long term? Where do you see this going? > At least current thoughts are: central entity authority monitors fail count and new threashold count. Fail count - as similar to other indicates how many time resource failure occured threshold count - indicates upto what this resource has gone upto in usage. (application might not be able to poll on thousands of such resources entries). So based on fail count and threshold count, it can tune it further. > Thanks. > > -- > tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Hello, Parav. On Fri, Sep 11, 2015 at 09:56:31PM +0530, Parav Pandit wrote: > Resource run away by application can lead to (a) kernel and (b) other > applications left out with no resources situation. Yeap, that this controller would be able to prevent to a reasonable extent. > Both the problems are the target of this patch set by accounting via cgroup. > > Performance contention can be resolved with higher level user space, > which will tune it. If individual applications are gonna be allowed to do that, what's to prevent them from jacking up their limits? So, I assume you're thinking of a central authority overseeing distribution and enforcing the policy through cgroups? > Threshold and fail counters are on the way in follow on patch. If you're planning on following what the existing memcg did in this area, it's unlikely to go well. Would you mind sharing what you have on mind in the long term? Where do you see this going? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
> If the resource isn't and the main goal is preventing runaway > hogs, it'll be able to do that but is that the goal here? For this to > be actually useful for performance contended cases, it'd need higher > level abstractions. > Resource run away by application can lead to (a) kernel and (b) other applications left out with no resources situation. Both the problems are the target of this patch set by accounting via cgroup. Performance contention can be resolved with higher level user space, which will tune it. Threshold and fail counters are on the way in follow on patch. > Thanks. > > -- > tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ehca driver status?
On 09/11/2015 04:55 AM, Alexander Schmidt wrote: > On Wed, 26 Aug 2015 06:46:48 -0700 > Christoph Hellwig wrote: > >> Hi Hoang-Nam, hi Christoph, hi Alexander, >> >> do you know what the current state of the eHCA driver and hardware is? >> >> The driver hasn't seen any targeted updats since 2010, so we wonder if >> it's still alive? It's one of the few drivers not supporting FRWRs, >> and it's also really odd in that it has its own dma_map_ops >> implementation that seems to pass virtual addresses to the hardware >> (or probably rather firmware), getting in the way of at least two >> currently pending cleanups. >> > > Hi again, > > we are okay with removing the driver from the tree, let me know if > there is anything we should do to help. I'll get it moved over to staging ASAP so it can be on the same schedule as the other two drivers we are deprecating. Thanks! -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Hello, Parav. On Fri, Sep 11, 2015 at 10:13:59AM +0530, Parav Pandit wrote: > > My uneducated suspicion is that the abstraction is just not developed > > enough. It should be possible to virtualize these resources through, > > most likely, time-sharing to the level where userland simply says "I > > want this chunk transferred there" and OS schedules the transfer > > prioritizing competing requests. > > Tejun, > That is such a perfect abstraction to have at OS level, but not sure > how much close it can be to bare metal RDMA it can be. > I have started discussion on that front as well as part of other > thread, but its certainly long way to go. > Most want to enjoy the performance benefit of the bare metal > interfaces it provides. Yeah, sure, I'm not trying to say that rdma needs or should do that. > Such abstraction that you mentioned, exists, the only difference is > instead of its OS as central entity, its the higher level libraries, > drivers and hw together does it today for the applications. But more that having resource control in the OS and actual arbitration higher up in the stack isn't likely to lead to an effective resource distribution scheme. > > You kinda have to decide that upfront cuz it gets baked into the > > interface. > > Well, all the interfaces are not yet defined. Except the test and I meant the cgroup interface. > benchmark utilities, real world applications wouldn't really bother > much about which device are they are going through. Weights can work fine across multiple devices. Hard limits don't. It just doesn't make any sense. Unless you can exclude multiple device scenarios, you'll have to implement per-device limits. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Hello, Doug. On Fri, Sep 11, 2015 at 12:24:33AM -0400, Doug Ledford wrote: > > My uneducated suspicion is that the abstraction is just not developed > > enough. > > The abstraction is 10+ years old. It has had plenty of time to ferment > and something better for the specific use case has not emerged. I think that is likely more reflective of the use cases rather than anything inherent in the concept. > > It should be possible to virtualize these resources through, > > most likely, time-sharing to the level where userland simply says "I > > want this chunk transferred there" and OS schedules the transfer > > prioritizing competing requests. > > No. And if you think this, then you miss the *entire* point of RDMA > technologies. An analogy that I have used many times in presentations > is that, in the networking world, the kernel is both a postman and a > copy machine. It receives all incoming packets and must sort them to > the right recipient (the postman job) and when the user space > application is ready to use the information it must copy it into the > user's VM space because it couldn't just put the user's data buffer on > the RX buffer list since each buffer might belong to anyone (the copy > machine). In the RDMA world, you create a new queue pair, it is often a > long lived connection (like a socket), but it belongs now to the app and > the app can directly queue both send and receive buffers to the card and > on incoming packets the card will be able to know that the packet > belongs to a specific queue pair and will immediately go to that apps > buffer. You can *not* do this with TCP without moving to complete TCP > offload on the card, registration of specific sockets on the card, and > then allowing the application to pre-register receive buffers for a > specific socket to the card so that incoming data on the wire can go > straight to the right place. If you ever get to the point of "OS > schedules the transfer" then you might as well throw RDMA out the window > because you have totally trashed the benefit it provides. I don't know. This sounds like classic "this is painful so it must be good" bare metal fantasy. I get that rdma succeeds at bypassing a lot of overhead. That's great but that really isn't exclusive with having more accessible mechanisms built on top. The crux of cost saving is the hardware knowing where the incoming data belongs and putting it there directly. Everything else is there to facilitate that and if you're declaring that it's impossible to build accessible abstractions for that, I can't agree with you. Note that this is not to say that rdma should do that in the operating system. As you said, people have been happy with the bare abstraction for a long time and, given relatively specialized use cases, that can be completely fine but please do note that the lack of proper abstraction isn't an inherent feature. It's just easier that way and putting in more effort hasn't been necessary. > > It could be that given the use cases rdma might not need such level of > > abstraction - e.g. most users want to be and are pretty close to bare > > metal, but, if that's true, it also kinda is weird to build > > hierarchical resource distribution scheme on top of such bare > > abstraction. > > Not really. If you are going to have a bare abstraction, this one isn't > really a bad one. You have devices. On a device, you allocate > protection domains (PDs). If you don't care about cross connection > issues, you ignore this and only use one. If you do care, this acts > like a process's unique VM space only for RDMA buffers, it is a domain > to protect the data of one connection from another. Then you have queue > pairs (QPs) which are roughly the equivalent of a socket. Each QP has > at least one Completion Queue where you get the events that tell you > things have completed (although they often use two, one for send > completions and one for receive completions). And then you use some > number of memory registrations (MRs) and address handles (AHs) depending > on your usage. Since RDMA stands for Remote Direct Memory Access, as > you can imagine, giving a remote machine free reign to access all of the > physical memory in your machine is a security issue. The MRs help to > control what memory the remote host on a specific QP has access to. The > AHs control how we actually route packets from ourselves to the remote host. > > Here's the deal. You might be able to create an abstraction above this > that hides *some* of this. But it can't hide even nearly all of it > without loosing significant functionality. The problem here is that you > are thinking about RDMA connections like sockets. They aren't. Not > even close. They are "how do I allow a remote machine to directly read > and write into my machines physical memory in an even remotely close to > secure manner?" These resources aren't hardware resources, they are the > abstraction resources neede
Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups
On Thu, 10 Sep 2015, Doug Ledford wrote: > + * 1) ifdown/ifup > + * 2) a regular mcast join/leave happens and we run > + *ipoib_mcast_restart_task > + * 3) a REREGISTER event comes in from the SM > + * 4) any other event that might cause a mcast flush Could we have a timeout and leave the multicast group on process exit? The old code base did that with the ipoib_mcast_leave_task() function. With that timeout we do not longer accumulate MC sendonly subscriptions for long running systems. Also IPOIB_MAX_MCAST_QUEUE's default to 3 is not really enough to capture a burst of traffic send to a multicast group. Can we make this configurable or increase the max? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 0/3] libibverbs: On-demand paging support
On 09/11/2015 03:14 AM, Or Gerlitz wrote: > On 9/10/2015 7:53 PM, Doug Ledford wrote: >>> >I don't see it in your kernel.org tree [1] >> Because I didn't push it yet. I can push it now >> >> Push complete. >> > > So we should keep chasing every commit you took and have humans look > for it among bunch of trees/branches, No. This is libibverbs, not my kernel tree. There is only one place to look Or. I hadn't pushed it out live yet even though I had accepted it into my tree, that's all. The rest of your complaints below have no basis in reality when it comes to the libibverbs tree. it's so tired-ing and annoying > with all our (no human) automation systems that attepmt to pick applied > patches into the internal review, build and staging systems completely > non-functioning for couple of months. > > Or. -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: ehca driver status?
On Wed, 26 Aug 2015 06:46:48 -0700 Christoph Hellwig wrote: > Hi Hoang-Nam, hi Christoph, hi Alexander, > > do you know what the current state of the eHCA driver and hardware is? > > The driver hasn't seen any targeted updats since 2010, so we wonder if > it's still alive? It's one of the few drivers not supporting FRWRs, > and it's also really odd in that it has its own dma_map_ops > implementation that seems to pass virtual addresses to the hardware > (or probably rather firmware), getting in the way of at least two > currently pending cleanups. > Hi again, we are okay with removing the driver from the tree, let me know if there is anything we should do to help. Regards, Alex -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 0/3] libibverbs: On-demand paging support
On 9/10/2015 7:53 PM, Doug Ledford wrote: >I don't see it in your kernel.org tree [1] Because I didn't push it yet. I can push it now Push complete. So we should keep chasing every commit you took and have humans look for it among bunch of trees/branches, it's so tired-ing and annoying with all our (no human) automation systems that attepmt to pick applied patches into the internal review, build and staging systems completely non-functioning for couple of months. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html