drivers/staging/rdma/hfi1/sdma.c:740: bad if test ?

2015-09-11 Thread David Binderman
Hello there,

drivers/staging/rdma/hfi1/sdma.c:740:17: warning: logical ‘and’ of mutually 
exclusive tests is always false [-Wlogical-op]

Source code is

    if (count < 64 && count> 32768)
    return SDMA_DESCQ_CNT;

Maybe better code

    if (count < 64 || count> 32768)
    return SDMA_DESCQ_CNT;


Regards

David Binderman

  --
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

2015-09-11 Thread Jason Gunthorpe
On Fri, Sep 11, 2015 at 05:40:41PM -0400, Doug Ledford wrote:
> Then a simple mcast_expire_task that runs every 10 minutes or so and
> leaves any send-only groups that haven't had a packet timestamp update
> in more than some arbitrary amount of time would be a very simple
> addition to make.

That all makes sense to me and addresses the backlog issue Christoph
mentioned.

It would also be fairly simple to do send-only join properly with the
above, as we simply switch to polling the SA to detect if the join is
good or not on the same timer that would expire it.

> > If it isn't subscribed to the broadcast MLID, it is violating MUST
> > statements in the RFC...
> 
> Yeah, my comment was in reference to whether or not it would receive a
> multicast on broadcast packet and forward it properly.

It would be hugely broken to be sensitive to the MLID when working
with multicast routing.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

2015-09-11 Thread Doug Ledford
On 09/11/2015 04:38 PM, Jason Gunthorpe wrote:
> On Fri, Sep 11, 2015 at 04:09:49PM -0400, Doug Ledford wrote:
>> On 09/11/2015 02:39 PM, Jason Gunthorpe wrote:
>>> On Thu, Sep 10, 2015 at 09:21:05PM -0400, Doug Ledford wrote:
 During the recent rework of the mcast handling in ipoib, the join
 task for regular and send-only joins were merged.  In the old code,
 the comments indicated that the ipoib driver didn't send enough
 information to auto-create IB multicast groups when the join was a
 send-only join.  The reality is that the comments said we didn't, but
 we actually did.  Since we merged the two join tasks, we now follow
 the comments and don't auto-create IB multicast groups for an ipoib
 send-only multicast join.  This has been reported to cause problems
 in certain environments that rely on this behavior.  Specifically,
 if you have an IB <-> Ethernet gateway then there is a fundamental
 mismatch between the methodologies used on the two fabrics.  On
 Ethernet, an app need not subscribe to a multicast group, merely
 listen.
>>>
>>> This should probably be clarified. On all IP networks IGMP/MLD is used
>>> to advertise listeners.
>>>
>>> A IB/Eth gateway is a router, and IP routers are expected to process
>>> IGMP - so the gateway certainly can (and maybe must) be copying
>>> groups declared with IGMP from the eth side into listeners on IB MGIDs
>>
>> Obviously, the gateway in question currently is not doing this.
> 
> Sure, my remark was the clarify the commit comment so people don't think
> this is OK/expected behavior from a gateway.
> 
>> We could drop the queue backlog entirely and just send to broadcast
>> when the multicast group is unsubscribed.
> 
> I'm pretty sure that would upset the people who care about this
> stuff.. Steady state operation has to eventually move to the optimal
> MLID.

I didn't mean to imply that we wouldn't attempt to join the group still.
 Just that right now the process is this:

send_mcast
  detect no-mcast group
initiate sendonly join, while queueing packet to be sent
  process join completion
success - send queued packet (which has been delayed)
fail - leave packet on queue, set backoff timer, try
  join again after timer expires, if we exceed the
  number of backoff/join attempts, drop packet

The people I was on the phone with indicated that they are seeing some
packet loss of mcast packets on these sendonly groups.  For one thing,
that backlog queue we have while waiting for the join to complete is
capped at 3 deep and drops anything beyond that.  So it is easy to
imagine that even a small burst of packets could cause a drop.  But this
process also introduces delay on the packets being sent.  My comment
above then was to suggest the following change:

send_mcast
  no-mcast group
initiate sendonly join
simultaneously send current packet to broadcast group
  process join completion
success - mark mcast group live, no queue backlog to send
fail - retry as above
  mcast group, but group still no live
send current packet to broadcast group
  mcast group, group is live
send current packet to mcast group

The major change here is that we would never queue the packets any more.
 If the group is not already successfully joined, then we simply send to
the broadcast group instead of queueing.  Once the group is live, we
start sending to the group.  This would eliminate packet delays assuming
that the gateway properly forwards multicast packets received on the
broadcast group versus on a specific IB multicast group.

Then a simple mcast_expire_task that runs every 10 minutes or so and
leaves any send-only groups that haven't had a packet timestamp update
in more than some arbitrary amount of time would be a very simple
addition to make.

>> Well, we've already established that the gateway device might be well be
>> broken.  That makes one wonder if this will work or if it might be
>> broken too.
> 
> If it isn't subscribed to the broadcast MLID, it is violating MUST
> statements in the RFC...

Yeah, my comment was in reference to whether or not it would receive a
multicast on broadcast packet and forward it properly.

>> and so this has been happening since forever in OFED (the above is from
>> 1.5.4.1).
> 
> But has this has been dropped from the new 3.x series that track
> upstream exactly?

I haven't check that (I don't have any of the later OFEDs downloaded,
and I didn't see a git repo for the 3.x series under Vlad's area on
OpenFabrics git server).

>> our list we add.  Because the sendonly groups are not tracked at the net
>> core level, our only option is to move them all to the remove list and
>> when we get another sendonly packet, rejoin.  Unless we want them to
>> stay around forever.  But since they aren't real send-only joins, they
>> are full joins where we simply ignore the incoming data, leaving them
>> around seems a bad idea.

Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

2015-09-11 Thread Jason Gunthorpe
On Fri, Sep 11, 2015 at 04:09:49PM -0400, Doug Ledford wrote:
> On 09/11/2015 02:39 PM, Jason Gunthorpe wrote:
> > On Thu, Sep 10, 2015 at 09:21:05PM -0400, Doug Ledford wrote:
> >> During the recent rework of the mcast handling in ipoib, the join
> >> task for regular and send-only joins were merged.  In the old code,
> >> the comments indicated that the ipoib driver didn't send enough
> >> information to auto-create IB multicast groups when the join was a
> >> send-only join.  The reality is that the comments said we didn't, but
> >> we actually did.  Since we merged the two join tasks, we now follow
> >> the comments and don't auto-create IB multicast groups for an ipoib
> >> send-only multicast join.  This has been reported to cause problems
> >> in certain environments that rely on this behavior.  Specifically,
> >> if you have an IB <-> Ethernet gateway then there is a fundamental
> >> mismatch between the methodologies used on the two fabrics.  On
> >> Ethernet, an app need not subscribe to a multicast group, merely
> >> listen.
> > 
> > This should probably be clarified. On all IP networks IGMP/MLD is used
> > to advertise listeners.
> > 
> > A IB/Eth gateway is a router, and IP routers are expected to process
> > IGMP - so the gateway certainly can (and maybe must) be copying
> > groups declared with IGMP from the eth side into listeners on IB MGIDs
> 
> Obviously, the gateway in question currently is not doing this.

Sure, my remark was the clarify the commit comment so people don't think
this is OK/expected behavior from a gateway.

> We could drop the queue backlog entirely and just send to broadcast
> when the multicast group is unsubscribed.

I'm pretty sure that would upset the people who care about this
stuff.. Steady state operation has to eventually move to the optimal
MLID.

> Well, we've already established that the gateway device might be well be
> broken.  That makes one wonder if this will work or if it might be
> broken too.

If it isn't subscribed to the broadcast MLID, it is violating MUST
statements in the RFC...

> and so this has been happening since forever in OFED (the above is from
> 1.5.4.1).

But has this has been dropped from the new 3.x series that track
upstream exactly?

> our list we add.  Because the sendonly groups are not tracked at the net
> core level, our only option is to move them all to the remove list and
> when we get another sendonly packet, rejoin.  Unless we want them to
> stay around forever.  But since they aren't real send-only joins, they
> are full joins where we simply ignore the incoming data, leaving them
> around seems a bad idea.

It doesn't make any sense to work like that. As is, the send-only
side looks pretty messed up to me.

It really needs to act like ND, and yah, that is a big change.

Just to be clear, I'm not entirely opposed to an OFED compatability
module option, but lets understand how this is broken, what the fix is
we want to see for mainline and why the OFED 'solution' is not
acceptable for mainline.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

2015-09-11 Thread Doug Ledford
On 09/11/2015 02:39 PM, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2015 at 09:21:05PM -0400, Doug Ledford wrote:
>> During the recent rework of the mcast handling in ipoib, the join
>> task for regular and send-only joins were merged.  In the old code,
>> the comments indicated that the ipoib driver didn't send enough
>> information to auto-create IB multicast groups when the join was a
>> send-only join.  The reality is that the comments said we didn't, but
>> we actually did.  Since we merged the two join tasks, we now follow
>> the comments and don't auto-create IB multicast groups for an ipoib
>> send-only multicast join.  This has been reported to cause problems
>> in certain environments that rely on this behavior.  Specifically,
>> if you have an IB <-> Ethernet gateway then there is a fundamental
>> mismatch between the methodologies used on the two fabrics.  On
>> Ethernet, an app need not subscribe to a multicast group, merely
>> listen.
> 
> This should probably be clarified. On all IP networks IGMP/MLD is used
> to advertise listeners.
> 
> A IB/Eth gateway is a router, and IP routers are expected to process
> IGMP - so the gateway certainly can (and maybe must) be copying
> groups declared with IGMP from the eth side into listeners on IB MGIDs

Obviously, the gateway in question currently is not doing this.

> We may also be making a mistake in IPoIB - if the send side MGID join
> fails, the send should probably go to the broadcast, and the join
> retried from time to time. This would at least let the gateway
> optimize a bit more by only creating groups in active use.

That's not *too* far off from what we do.  This change could be made
without a huge amount of effort.  Right now we queue up the packet and
retry the join on a timer.  Only after all of the join attempts have
failed their timed retries do we dequeue and drop the packets.  We could
drop the queue backlog entirely and just send to broadcast when the
multicast group is unsubscribed.

> That would emulate the ethernet philosophy of degrade to broadcast.

Indeed.

> TBH, fixing broken gateway devices by sending to broadcast appeals to
> me more than making a module option..

Well, we've already established that the gateway device might be well be
broken.  That makes one wonder if this will work or if it might be
broken too.

>> on the IB side of the gateway.  There are instances of installations
>> with 100's (maybe 1000's) of multicast groups where static creation
>> of all the groups is not practical that rely upon the send-only
> 
> With my last patch the SM now has enough information to auto-create
> the wonky send-only join attempts, if wanted to. It just needs to fill
> in tclass, so it certainly is possible to address this long term
> without a kernel patch.
> 
>> joins creating the IB multicast group in order to function, so to
>> preserve these existing installations, add a module option to the
>> ipoib module to restore the previous behavior.
> 
> .. but an option to restore behavior of an older kernel version makes
> sense - did we really have a kernel that completely converted a
> send-only join to full join&create?

I just re-checked to verify this and here's what I found:

kernels v3.10, v3,19, and v4.0 have this in ipoib_mcast_sendonly_join:

if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n");
return -EBUSY;
}

rec.mgid = mcast->mcmember.mgid;
rec.port_gid = priv->local_gid;
rec.pkey = cpu_to_be16(priv->pkey);

comp_mask =
IB_SA_MCMEMBER_REC_MGID |
IB_SA_MCMEMBER_REC_PORT_GID |
IB_SA_MCMEMBER_REC_PKEY |
IB_SA_MCMEMBER_REC_JOIN_STATE;

mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
 priv->port, &rec,
 comp_mask,
 GFP_ATOMIC,
 ipoib_mcast_sendonly_join_complete,
 mcast);

kernel v4.1 and on no longer have sendonly_join as a separate function.
 So, to answer your question, no, upstream didn't do this.  This,
however, is carried in OFED:

rec.mgid = mcast->mcmember.mgid;
rec.port_gid = priv->local_gid;
rec.pkey = cpu_to_be16(priv->pkey);

comp_mask =
IB_SA_MCMEMBER_REC_MGID |
IB_SA_MCMEMBER_REC_PORT_GID |
IB_SA_MCMEMBER_REC_PKEY |
IB_SA_MCMEMBER_REC_JOIN_STATE;

if (priv->broadcast) {
comp_mask |=
IB_SA_MCMEMBER_REC_QKEY |
IB_SA_MCMEMBER_REC_MTU_SELECTOR |
IB_SA_MCMEMBER_REC_MTU  |
IB_SA_MCM

RE: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Hefty, Sean
> > Trying to limit the number of QPs that an app can allocate,
> > therefore, just limits how much of the address space an app can use.
> > There's no clear link between QP limits and HW resource limits,
> > unless you assume a very specific underlying implementation.
> 
> Isn't that the point though? We have several vendors with hardware
> that does impose hard limits on specific resources. There is no way to
> avoid that, and ultimately, those exact HW resources need to be
> limited.

My point is that limiting the number of QPs that an app can allocate doesn't 
necessarily mean anything.  Is allocating 1000 QPs with 1 entry each better or 
worse than 1 QP with 10,000 entries?  Who knows?

> If we want to talk about abstraction, then I'd suggest something very
> general and simple - two limits:
>  '% of the RDMA hardware resource pool' (per device or per ep?)
>  'bytes of kernel memory for RDMA structures' (all devices)

Yes - this makes more sense to me.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Jason Gunthorpe
On Fri, Sep 11, 2015 at 07:22:56PM +, Hefty, Sean wrote:
 
> Trying to limit the number of QPs that an app can allocate,
> therefore, just limits how much of the address space an app can use.
> There's no clear link between QP limits and HW resource limits,
> unless you assume a very specific underlying implementation.

Isn't that the point though? We have several vendors with hardware
that does impose hard limits on specific resources. There is no way to
avoid that, and ultimately, those exact HW resources need to be
limited.

If we want to talk about abstraction, then I'd suggest something very
general and simple - two limits:
 '% of the RDMA hardware resource pool' (per device or per ep?)
 'bytes of kernel memory for RDMA structures' (all devices)

That comfortably covers all the various kinds of hardware we support
in a reasonable fashion.

Unless there really is a reason why we need to constrain exactly
and precisely PD/QP/MR/AH (I can't think of one off hand)

The 'RDMA hardware resource pool' is a vendor-driver-device specific
thing, with no generic definition beyond something that doesn't fit in
the other limit.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

2015-09-11 Thread Doug Ledford
On 09/11/2015 02:21 PM, Jason Gunthorpe wrote:
> On Fri, Sep 11, 2015 at 09:40:25AM -0500, Christoph Lameter wrote:
>> On Thu, 10 Sep 2015, Doug Ledford wrote:
>>
>>> +* 1) ifdown/ifup
>>> +* 2) a regular mcast join/leave happens and we run
>>> +*ipoib_mcast_restart_task
>>> +* 3) a REREGISTER event comes in from the SM
>>> +* 4) any other event that might cause a mcast flush
>>
>> Could we have a timeout and leave the multicast group on process
>> exit?
> 
> At a minimum, when the socket that did the send closes the send-only
> could be de-refed..

If we kept a ref count, but we don't.  Tracking this is not a small change.

> But really send-only in IB is very similar to neighbour discovery, and
> should work the same, with a timer, etc. In my mind it would be ideal
> to even add them to the ND table, if the core would support that..
> 
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Tejun Heo
Hello, Parav.

On Fri, Sep 11, 2015 at 10:09:48PM +0530, Parav Pandit wrote:
> > If you're planning on following what the existing memcg did in this
> > area, it's unlikely to go well.  Would you mind sharing what you have
> > on mind in the long term?  Where do you see this going?
>
> At least current thoughts are: central entity authority monitors fail
> count and new threashold count.
> Fail count - as similar to other indicates how many time resource
> failure occured
> threshold count - indicates upto what this resource has gone upto in
> usage. (application might not be able to poll on thousands of such
> resources entries).
> So based on fail count and threshold count, it can tune it further.

So, regardless of the specific resource in question, implementing
adaptive resource distribution requires more than simple thresholds
and failcnts.  The very minimum would be a way to exert reclaim
pressure and then a way to measure how much lack of a given resource
is affecting the workload.  Maybe it can adaptively lower the limits
and then watch how often allocation fails but that's highly unlikely
to be an effective measure as it can't do anything to hoarders and the
frequency of allocation failure doesn't necessarily correlate with the
amount of impact the workload is getting (it's not a measure of
usage).

This is what I'm awry about.  The kernel-userland interface here is
cut pretty low in the stack leaving most of arbitration and management
logic in the userland, which seems to be what people wanted and that's
fine, but then you're trying to implement an intelligent resource
control layer which straddles across kernel and userland with those
low level primitives which inevitably would increase the required
interface surface as nobody has enough information.

Just to illustrate the point, please think of the alsa interface.  We
expose hardware capabilities pretty much as-is leaving management and
multiplexing to userland and there's nothing wrong with it.  It fits
better that way; however, we don't then go try to implement cgroup
controller for PCM channels.  To do any high-level resource
management, you gotta do it where the said resource is actually
managed and arbitrated.

What's the allocation frequency you're expecting?  It might be better
to just let allocations themselves go through the agent that you're
planning.  You sure can use cgroup membership to identify who's asking
tho.  Given how the whole thing is architectured, I'd suggest thinking
more about how the whole thing should turn out eventually.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Hefty, Sean
> So, the existence of resource limitations is fine.  That's what we
> deal with all the time.  The problem usually with this sort of
> interfaces which expose implementation details to users directly is
> that it severely limits engineering manuevering space.  You usually
> want your users to express their intentions and a mechanism to
> arbitrate resources to satisfy those intentions (and in a way more
> graceful than "we can't, maybe try later?"); otherwise, implementing
> any sort of high level resource distribution scheme becomes painful
> and usually the only thing possible is preventing runaway disasters -
> you don't wanna pin unused resource permanently if there actually is
> contention around it, so usually all you can do with hard limits is
> overcommiting limits so that it at least prevents disasters.

I agree with Tejun that this proposal is at the wrong level of abstraction.

If you look at just trying to limit QPs, it's not clear what that attempts to 
accomplish.  Conceptually, a QP is little more than an addressable endpoint.  
It may or may not map to HW resources (for Intel NICs it does not).  Even when 
HW resources do back the QP, the hardware is limited by how many QPs can 
realistically be active at any one time, based on how much caching is available 
in the NIC.

Trying to limit the number of QPs that an app can allocate, therefore, just 
limits how much of the address space an app can use.  There's no clear link 
between QP limits and HW resource limits, unless you assume a very specific 
underlying implementation.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Tejun Heo
Hello, Parav.

On Fri, Sep 11, 2015 at 10:17:42PM +0530, Parav Pandit wrote:
> IO controller and applications are mature in nature.
> When IO controller throttles the IO, applications are pretty mature
> where if IO takes longer to complete, there is possibly almost no way
> to cancel the system call or rather application might not want to
> cancel the IO at least the non asynchronous one.

I was more talking about the fact that they allow resources to be
consumed when they aren't contended.

> So application just notice lower performance than throttled way.
> Its really not possible at RDMA level with RDMA resource to hold up
> resource creation call for longer time, because reusing existing
> resource with failed status can likely to give better performance.
> As Doug explained in his example, many RDMA resources as its been used
> by applications are relatively long lived. So holding ups resource
> creation while its taken by other process will certainly will look bad
> on application performance front compare to returning failure and
> reusing existing one once its available or once new one is available.

I'm not really sold on the idea that this can be used to implement
performance based resource distribution.  I'll write more about that
on the other subthread.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

2015-09-11 Thread Jason Gunthorpe
On Thu, Sep 10, 2015 at 09:21:05PM -0400, Doug Ledford wrote:
> During the recent rework of the mcast handling in ipoib, the join
> task for regular and send-only joins were merged.  In the old code,
> the comments indicated that the ipoib driver didn't send enough
> information to auto-create IB multicast groups when the join was a
> send-only join.  The reality is that the comments said we didn't, but
> we actually did.  Since we merged the two join tasks, we now follow
> the comments and don't auto-create IB multicast groups for an ipoib
> send-only multicast join.  This has been reported to cause problems
> in certain environments that rely on this behavior.  Specifically,
> if you have an IB <-> Ethernet gateway then there is a fundamental
> mismatch between the methodologies used on the two fabrics.  On
> Ethernet, an app need not subscribe to a multicast group, merely
> listen.

This should probably be clarified. On all IP networks IGMP/MLD is used
to advertise listeners.

A IB/Eth gateway is a router, and IP routers are expected to process
IGMP - so the gateway certainly can (and maybe must) be copying
groups declared with IGMP from the eth side into listeners on IB MGIDs

We may also be making a mistake in IPoIB - if the send side MGID join
fails, the send should probably go to the broadcast, and the join
retried from time to time. This would at least let the gateway
optimize a bit more by only creating groups in active use.

That would emulate the ethernet philosophy of degrade to broadcast.

TBH, fixing broken gateway devices by sending to broadcast appeals to
me more than making a module option..

> on the IB side of the gateway.  There are instances of installations
> with 100's (maybe 1000's) of multicast groups where static creation
> of all the groups is not practical that rely upon the send-only

With my last patch the SM now has enough information to auto-create
the wonky send-only join attempts, if wanted to. It just needs to fill
in tclass, so it certainly is possible to address this long term
without a kernel patch.

> joins creating the IB multicast group in order to function, so to
> preserve these existing installations, add a module option to the
> ipoib module to restore the previous behavior.

.. but an option to restore behavior of an older kernel version makes
sense - did we really have a kernel that completely converted a
send-only join to full join&create?

> +  * An additional problem is that if we auto-create the IB
> +  * mcast group in response to a send-only action, then we
> +  * will be the creating entity, but we will not have any
> +  * mechanism by which we will track when we should leave
> +  * the group ourselves.  We will occasionally leave and
> +  * re-join the group when these events occur:

I would drop the langauge of creating-entity, that isn't something
from the IBA.

The uncontrolled lifetime of the join is not related to creating or
not.

> +  * 2) a regular mcast join/leave happens and we run
> +  *ipoib_mcast_restart_task

Really? All send only mgids are discarded at that point? Ugly.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

2015-09-11 Thread Jason Gunthorpe
On Fri, Sep 11, 2015 at 09:40:25AM -0500, Christoph Lameter wrote:
> On Thu, 10 Sep 2015, Doug Ledford wrote:
> 
> > +* 1) ifdown/ifup
> > +* 2) a regular mcast join/leave happens and we run
> > +*ipoib_mcast_restart_task
> > +* 3) a REREGISTER event comes in from the SM
> > +* 4) any other event that might cause a mcast flush
> 
> Could we have a timeout and leave the multicast group on process
> exit?

At a minimum, when the socket that did the send closes the send-only
could be de-refed..

But really send-only in IB is very similar to neighbour discovery, and
should work the same, with a timer, etc. In my mind it would be ideal
to even add them to the ND table, if the core would support that..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/ehca: Deprecate driver, move to staging, schedule deletion

2015-09-11 Thread Doug Ledford
The ehca driver is only supported on IBM machines with a custom EBus.
As they have opted to build their newer machines using more industry
standard technology and haven't really been pushing EBus capable
machines for a while, this driver can now safely be moved to the
staging area and scheduled for eventual removal.  This plan was brought
to IBM's attention and received their sign-off.

Cc: al...@linux.vnet.ibm.com
Cc: hngu...@de.ibm.com
Cc: rai...@de.ibm.com
Cc: stefan.rosc...@de.ibm.com
Signed-off-by: Doug Ledford 
---
 drivers/infiniband/Kconfig  | 1 -
 drivers/infiniband/hw/Makefile  | 1 -
 drivers/staging/rdma/Kconfig| 2 ++
 drivers/staging/rdma/Makefile   | 1 +
 drivers/{infiniband/hw => staging/rdma}/ehca/Kconfig| 3 ++-
 drivers/{infiniband/hw => staging/rdma}/ehca/Makefile   | 0
 drivers/staging/rdma/ehca/TODO  | 4 
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_av.c  | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_classes.h | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_classes_pSeries.h | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_cq.c  | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_eq.c  | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_hca.c | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_irq.c | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_irq.h | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_iverbs.h  | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_main.c| 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_mcast.c   | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_mrmw.c| 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_mrmw.h| 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_pd.c  | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_qes.h | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_qp.c  | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_reqs.c| 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_sqp.c | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_tools.h   | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ehca_uverbs.c  | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/hcp_if.c   | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/hcp_if.h   | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/hcp_phyp.c | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/hcp_phyp.h | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/hipz_fns.h | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/hipz_fns_core.h| 0
 drivers/{infiniband/hw => staging/rdma}/ehca/hipz_hw.h  | 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ipz_pt_fn.c| 0
 drivers/{infiniband/hw => staging/rdma}/ehca/ipz_pt_fn.h| 0
 36 files changed, 9 insertions(+), 3 deletions(-)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/Kconfig (69%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/Makefile (100%)
 create mode 100644 drivers/staging/rdma/ehca/TODO
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_av.c (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_classes.h (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_classes_pSeries.h 
(100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_cq.c (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_eq.c (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_hca.c (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_irq.c (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_irq.h (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_iverbs.h (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_main.c (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_mcast.c (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_mrmw.c (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_mrmw.h (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_pd.c (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_qes.h (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_qp.c (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_reqs.c (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_sqp.c (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_tools.h (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/ehca_uverbs.c (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/hcp_if.c (100%)
 rename drivers/{infiniband/hw => staging/rdma}/ehca/hcp_if.h (100%)
 ren

Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Parav Pandit
> cpuset is a special case but think of cpu, memory or io controllers.
> Their resource distribution schemes are a lot more developed than
> what's proposed in this patchset and that's a necessity because nobody
> wants to cripple their machines for resource control.

IO controller and applications are mature in nature.
When IO controller throttles the IO, applications are pretty mature
where if IO takes longer to complete, there is possibly almost no way
to cancel the system call or rather application might not want to
cancel the IO at least the non asynchronous one.
So application just notice lower performance than throttled way.
Its really not possible at RDMA level with RDMA resource to hold up
resource creation call for longer time, because reusing existing
resource with failed status can likely to give better performance.
As Doug explained in his example, many RDMA resources as its been used
by applications are relatively long lived. So holding ups resource
creation while its taken by other process will certainly will look bad
on application performance front compare to returning failure and
reusing existing one once its available or once new one is available.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Parav Pandit
On Fri, Sep 11, 2015 at 10:04 PM, Tejun Heo  wrote:
> Hello, Parav.
>
> On Fri, Sep 11, 2015 at 09:56:31PM +0530, Parav Pandit wrote:
>> Resource run away by application can lead to (a) kernel and (b) other
>> applications left out with no resources situation.
>
> Yeap, that this controller would be able to prevent to a reasonable
> extent.
>
>> Both the problems are the target of this patch set by accounting via cgroup.
>>
>> Performance contention can be resolved with higher level user space,
>> which will tune it.
>
> If individual applications are gonna be allowed to do that, what's to
> prevent them from jacking up their limits?
I should have been more explicit. I didnt mean the application to
control which is allocating it.
> So, I assume you're
> thinking of a central authority overseeing distribution and enforcing
> the policy through cgroups?
>
Exactly.



>> Threshold and fail counters are on the way in follow on patch.
>
> If you're planning on following what the existing memcg did in this
> area, it's unlikely to go well.  Would you mind sharing what you have
> on mind in the long term?  Where do you see this going?
>
At least current thoughts are: central entity authority monitors fail
count and new threashold count.
Fail count - as similar to other indicates how many time resource
failure occured
threshold count - indicates upto what this resource has gone upto in
usage. (application might not be able to poll on thousands of such
resources entries).
So based on fail count and threshold count, it can tune it further.




> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Tejun Heo
Hello, Parav.

On Fri, Sep 11, 2015 at 09:56:31PM +0530, Parav Pandit wrote:
> Resource run away by application can lead to (a) kernel and (b) other
> applications left out with no resources situation.

Yeap, that this controller would be able to prevent to a reasonable
extent.

> Both the problems are the target of this patch set by accounting via cgroup.
> 
> Performance contention can be resolved with higher level user space,
> which will tune it.

If individual applications are gonna be allowed to do that, what's to
prevent them from jacking up their limits?  So, I assume you're
thinking of a central authority overseeing distribution and enforcing
the policy through cgroups?

> Threshold and fail counters are on the way in follow on patch.

If you're planning on following what the existing memcg did in this
area, it's unlikely to go well.  Would you mind sharing what you have
on mind in the long term?  Where do you see this going?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Parav Pandit
> If the resource isn't and the main goal is preventing runaway
> hogs, it'll be able to do that but is that the goal here?  For this to
> be actually useful for performance contended cases, it'd need higher
> level abstractions.
>

Resource run away by application can lead to (a) kernel and (b) other
applications left out with no resources situation.
Both the problems are the target of this patch set by accounting via cgroup.

Performance contention can be resolved with higher level user space,
which will tune it.
Threshold and fail counters are on the way in follow on patch.

> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ehca driver status?

2015-09-11 Thread Doug Ledford
On 09/11/2015 04:55 AM, Alexander Schmidt wrote:
> On Wed, 26 Aug 2015 06:46:48 -0700
> Christoph Hellwig  wrote:
> 
>> Hi Hoang-Nam, hi Christoph, hi Alexander,
>>
>> do you know what the current state of the eHCA driver and hardware is?
>>
>> The driver hasn't seen any targeted updats since 2010, so we wonder if
>> it's still alive?  It's one of the few drivers not supporting FRWRs,
>> and it's also really odd in that it has its own dma_map_ops
>> implementation that seems to pass virtual addresses to the hardware
>> (or probably rather firmware), getting in the way of at least two
>> currently pending cleanups.
>>
> 
> Hi again,
> 
> we are okay with removing the driver from the tree, let me know if
> there is anything we should do to help.

I'll get it moved over to staging ASAP so it can be on the same schedule
as the other two drivers we are deprecating.  Thanks!


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Tejun Heo
Hello, Parav.

On Fri, Sep 11, 2015 at 10:13:59AM +0530, Parav Pandit wrote:
> > My uneducated suspicion is that the abstraction is just not developed
> > enough.  It should be possible to virtualize these resources through,
> > most likely, time-sharing to the level where userland simply says "I
> > want this chunk transferred there" and OS schedules the transfer
> > prioritizing competing requests.
> 
> Tejun,
> That is such a perfect abstraction to have at OS level, but not sure
> how much close it can be to bare metal RDMA it can be.
> I have started discussion on that front as well as part of other
> thread, but its certainly long way to go.
> Most want to enjoy the performance benefit of the bare metal
> interfaces it provides.

Yeah, sure, I'm not trying to say that rdma needs or should do that.

> Such abstraction that you mentioned, exists, the only difference is
> instead of its OS as central entity, its the higher level libraries,
> drivers and hw together does it today for the applications.

But more that having resource control in the OS and actual arbitration
higher up in the stack isn't likely to lead to an effective resource
distribution scheme.

> > You kinda have to decide that upfront cuz it gets baked into the
> > interface.
> 
> Well, all the interfaces are not yet defined. Except the test and

I meant the cgroup interface.

> benchmark utilities, real world applications wouldn't really bother
> much about which device are they are going through.

Weights can work fine across multiple devices.  Hard limits don't.  It
just doesn't make any sense.  Unless you can exclude multiple device
scenarios, you'll have to implement per-device limits.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Tejun Heo
Hello, Doug.

On Fri, Sep 11, 2015 at 12:24:33AM -0400, Doug Ledford wrote:
> > My uneducated suspicion is that the abstraction is just not developed
> > enough.
> 
> The abstraction is 10+ years old.  It has had plenty of time to ferment
> and something better for the specific use case has not emerged.

I think that is likely more reflective of the use cases rather than
anything inherent in the concept.

> >  It should be possible to virtualize these resources through,
> > most likely, time-sharing to the level where userland simply says "I
> > want this chunk transferred there" and OS schedules the transfer
> > prioritizing competing requests.
> 
> No.  And if you think this, then you miss the *entire* point of RDMA
> technologies.  An analogy that I have used many times in presentations
> is that, in the networking world, the kernel is both a postman and a
> copy machine.  It receives all incoming packets and must sort them to
> the right recipient (the postman job) and when the user space
> application is ready to use the information it must copy it into the
> user's VM space because it couldn't just put the user's data buffer on
> the RX buffer list since each buffer might belong to anyone (the copy
> machine).  In the RDMA world, you create a new queue pair, it is often a
> long lived connection (like a socket), but it belongs now to the app and
> the app can directly queue both send and receive buffers to the card and
> on incoming packets the card will be able to know that the packet
> belongs to a specific queue pair and will immediately go to that apps
> buffer.  You can *not* do this with TCP without moving to complete TCP
> offload on the card, registration of specific sockets on the card, and
> then allowing the application to pre-register receive buffers for a
> specific socket to the card so that incoming data on the wire can go
> straight to the right place.  If you ever get to the point of "OS
> schedules the transfer" then you might as well throw RDMA out the window
> because you have totally trashed the benefit it provides.

I don't know.  This sounds like classic "this is painful so it must be
good" bare metal fantasy.  I get that rdma succeeds at bypassing a lot
of overhead.  That's great but that really isn't exclusive with having
more accessible mechanisms built on top.  The crux of cost saving is
the hardware knowing where the incoming data belongs and putting it
there directly.  Everything else is there to facilitate that and if
you're declaring that it's impossible to build accessible abstractions
for that, I can't agree with you.

Note that this is not to say that rdma should do that in the operating
system.  As you said, people have been happy with the bare abstraction
for a long time and, given relatively specialized use cases, that can
be completely fine but please do note that the lack of proper
abstraction isn't an inherent feature.  It's just easier that way and
putting in more effort hasn't been necessary.

> > It could be that given the use cases rdma might not need such level of
> > abstraction - e.g. most users want to be and are pretty close to bare
> > metal, but, if that's true, it also kinda is weird to build
> > hierarchical resource distribution scheme on top of such bare
> > abstraction.
> 
> Not really.  If you are going to have a bare abstraction, this one isn't
> really a bad one.  You have devices.  On a device, you allocate
> protection domains (PDs).  If you don't care about cross connection
> issues, you ignore this and only use one.  If you do care, this acts
> like a process's unique VM space only for RDMA buffers, it is a domain
> to protect the data of one connection from another.  Then you have queue
> pairs (QPs) which are roughly the equivalent of a socket.  Each QP has
> at least one Completion Queue where you get the events that tell you
> things have completed (although they often use two, one for send
> completions and one for receive completions).  And then you use some
> number of memory registrations (MRs) and address handles (AHs) depending
> on your usage.  Since RDMA stands for Remote Direct Memory Access, as
> you can imagine, giving a remote machine free reign to access all of the
> physical memory in your machine is a security issue.  The MRs help to
> control what memory the remote host on a specific QP has access to.  The
> AHs control how we actually route packets from ourselves to the remote host.
> 
> Here's the deal.  You might be able to create an abstraction above this
> that hides *some* of this.  But it can't hide even nearly all of it
> without loosing significant functionality.  The problem here is that you
> are thinking about RDMA connections like sockets.  They aren't.  Not
> even close.  They are "how do I allow a remote machine to directly read
> and write into my machines physical memory in an even remotely close to
> secure manner?"  These resources aren't hardware resources, they are the
> abstraction resources neede

Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

2015-09-11 Thread Christoph Lameter
On Thu, 10 Sep 2015, Doug Ledford wrote:

> +  * 1) ifdown/ifup
> +  * 2) a regular mcast join/leave happens and we run
> +  *ipoib_mcast_restart_task
> +  * 3) a REREGISTER event comes in from the SM
> +  * 4) any other event that might cause a mcast flush

Could we have a timeout and leave the multicast group on process exit?
The old code base did that with the ipoib_mcast_leave_task() function.

With that timeout we do not longer accumulate MC sendonly subscriptions
for long running systems.

Also IPOIB_MAX_MCAST_QUEUE's default to 3 is not really enough to capture
a burst of traffic send to a multicast group. Can we make this
configurable or increase the max?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/3] libibverbs: On-demand paging support

2015-09-11 Thread Doug Ledford
On 09/11/2015 03:14 AM, Or Gerlitz wrote:
> On 9/10/2015 7:53 PM, Doug Ledford wrote:
>>> >I don't see it in your kernel.org tree [1]
>> Because I didn't push it yet.  I can push it now
>>
>> Push complete.
>>
> 
> So we should keep chasing every commit you took and have humans  look
> for it among bunch of trees/branches,

No.  This is libibverbs, not my kernel tree.  There is only one place to
look Or.  I hadn't pushed it out live yet even though I had accepted it
into my tree, that's all.  The rest of your complaints below have no
basis in reality when it comes to the libibverbs tree.

 it's so tired-ing and annoying
> with all our (no human) automation systems that attepmt to pick applied
> patches into the internal review, build and staging systems completely
> non-functioning for couple of months.
> 
> Or.


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: ehca driver status?

2015-09-11 Thread Alexander Schmidt
On Wed, 26 Aug 2015 06:46:48 -0700
Christoph Hellwig  wrote:

> Hi Hoang-Nam, hi Christoph, hi Alexander,
> 
> do you know what the current state of the eHCA driver and hardware is?
> 
> The driver hasn't seen any targeted updats since 2010, so we wonder if
> it's still alive?  It's one of the few drivers not supporting FRWRs,
> and it's also really odd in that it has its own dma_map_ops
> implementation that seems to pass virtual addresses to the hardware
> (or probably rather firmware), getting in the way of at least two
> currently pending cleanups.
> 

Hi again,

we are okay with removing the driver from the tree, let me know if
there is anything we should do to help.

Regards,
Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 0/3] libibverbs: On-demand paging support

2015-09-11 Thread Or Gerlitz

On 9/10/2015 7:53 PM, Doug Ledford wrote:

>I don't see it in your kernel.org tree [1]

Because I didn't push it yet.  I can push it now

Push complete.



So we should keep chasing every commit you took and have humans  look 
for it among bunch of trees/branches, it's so tired-ing and annoying 
with all our (no human) automation systems that attepmt to pick applied 
patches into the internal review, build and staging systems completely 
non-functioning for couple of months.


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html