Re: [PATCH 10/11] IB: only keep a single key in struct ib_mr

2015-12-25 Thread Liran Liss
> From: Jason Gunthorpe 

> > >fill mr->key by the lkey or rkey based on that and everything will
> > >work fine.
> >
> > But the ULP *can* register a memory buffer with local and remote
> > access permissions.
> Not in the new API.
>
> If a ULP ever comes along that does need that then they can start with
> two MRs and then eventually upgrade the kapi to have some kind of
> efficient bidir MR mode.

ULPs are *already* using the same registrations for both local and
remote access.
So, currently, this is a no-go.
Sorry.

Let's make the required adjustments, which don't violate specs, and
then continue.

>
> What we've seen on the list lately is that every single ULP seems to
> have technical problems running the stack properly. We need to get off
> this idea that the spec has to govern the kapi - that didn't lead us
> any place nice.
>

Oh, please.
Our stack connects millions of endpoints in production using multiple
ULPs for the last 15 years with full interoperability because of
specs.

I am sure that we can keep improving our APIs while adhering to
industry standards, as we have done very successfully until now.

--Liran
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-17 Thread Liran Liss
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
> ow...@vger.kernel.org] On Behalf Of Doug Ledford

> These patches add the concept of duplicate GIDs that are differentiated by
> their RoCE version (also called network type).  So, now, an incoming packet
> could match a couple different gid_indexes and we need additional
> information to get back to the definitive 1:1 mapping.  The submitted patches
> are designed around a lazy resolution of the namespace, preferring to defer
> the work of mapping the incoming packet to a unique namespace until that
> information is actually needed.  To enable this lazy resolution, it provides 
> the
> network_type so that the resolution can be done.
> 
> This is a fair assessment of the current state of things and what these
> patches do, yes?

Hi Doug,
Yes, your description nails the crux of the discussion.
Thanks.

> 
> Jason's objections are this:
> 
> 1)  The lazy resolution is wrong.
> 2)  The use of network_type as the additional information to get to the
> unique namespace is vendor specific cruft that shouldn't be part of the core
> kernel API.
> 

Which is totally wrong. RoCE versions and the network type attribute are part 
of the RoCE specification.
There is nothing vendor specific. In fact, this patch initially came from 
Avago!!!

> Jason's preference would be that the above issues be resolved by skipping
> the lazy resolution and instead doing proactive resolution on receipt of a
> packet and then probably just pass the namespace around instead of passing
> around the information needed to resolve the namespace.  Or, at a
> minimum, at least make the information added to the core API not
> something vendor specific like network_type, which is a detail of the
> Mellanox implementation.

Again, there is nothing here specific to the Mellanox implementation.
This argument is invalid.

> 
> Jason, is this accurate for your position?
> 
> If everyone agrees that this is a fair statement of where we stand, then I'll
> continue my response.  If not, please correct anything I have wrong above
> and I'll take that into my continued response.
> 

Since we are following the standard, and:
- This is *not* a vendor specific issue
- Lazy resolution adheres to the current flow of the stack (which IMO, is the 
right thing to do anyway)
- There are no security issues whatsoever (see below)
- The code could be easily extended to support namespaces in the future
- We should "release early release often", and not hold progress because of 
endless philosophical discussions
then let's apply these patches.

I perfectly understand Jason's proposal and its merits.
I just don't agree that it is a better approach, and in any case this is not 
the time to continue discussing it.

> 1 - Actually, for any received packet with associated IP address information.
> We've only enabled net namespaces for IP connections between user space
> applications, for direct IB connections or for kernel connections there is not
> yet any namespace support.

Correct.
That's why we are not risking anything in the KAPI with regard to namespaces; 
there is no security issue.

--Liran

> 
> --
> Doug Ledford 
>   GPG KeyID: 0E572FDD
> 

N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

RE: [PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-16 Thread Liran Liss
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
> ow...@vger.kernel.org] On Behalf Of Doug Ledford

> In particular, Liran piped up with this comment:
> 
> "Also, I don't want to do any route resolution on the Rx path. A UD QP
> completion just reports the details of the packet it received.
> 
> Conceptually, an incoming packet may not even match an SGID index at all.
> Maybe, responses should be sent from another device. This should not be
> decided that the point that a packet was received."
> 
> The part that bothers me about this is that this statement makes sense when
> just thinking about the spec, as you say.  However, once you consider
> namespaces, security implications make this statement spec compliant, but
> still unacceptable.  The spec itself is silent on namespaces.  But, you guys
> wanted, and you got, namespace support.
> Since that's beyond spec, and carries security requirements, I think it's 
> fair to
> say that from now on, the Linux kernel RDMA stack can no longer *just* be
> spec compliant.  There are additional concerns that must always be
> addressed with new changes, and those are the namespace constraint
> preservation concerns.
> 

Hi Doug,

Currently, there is no namespace support for RoCE, so the RoCEv2 patches have 
*nothing* to do with this.
That said, the RoCE specification does not contradict or inhibit any future 
implementation for namespaces.
The CMA will get the  from ib_wc and resolve to a netdev (or 
sgid_index->netdev, whatever) and process the request accordingly.

We can have endless theoretical discussions on features that are not even 
implemented yet (e.g., RoCE namespace support) each time we add a minor 
straightforward, *spec-compliant* change that *all* RoCE vendors adhere to.
If someone wishes to introduce a new concept, API refactoring proposal, or 
similar for community review, please do so with a different RFC.

This is hindering progress of the whole RDMA stack development!
For example, the posted SoftRoCE patches are waiting just for this.

The RoCEv2 patches have been posted upstream for review for months (!) now.
I simply cannot understand why this is lagging for so long; let's start to get 
the wheels rolling.
--Liran



RE: [PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-16 Thread Liran Liss
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-

> > Since you and Jason did not reach a consensus, I have to dig in and
> > see if these patches make it possible to break namespace confinement,
> > either accidentally or with intentionally tricky behavior.  That's
> > going to take me some time.
> 
> Everything to do with parsing a wc and constructing an AH is wrong in this
> series, and the fixes require the API changes I mentioned ( add 'wc to gid
> index' API call, add 'route to AH' API call)
> 
> Every time you read 'route validation' - that is an error, the route should
> never just be validated, it is needed information to construct a rocev2 AH. 
> All
> the places that roughly hand parse the rocev2 WC should not be open coded.
> 
> Even if current HW is broken for namespaces we should not enshrine that in
> the kapi.
>

Currently, namespaces are not supported for RoCE.
So for this patches, this is irrelevant.
That said, we have everything we need for RoCE namespace support when we get 
there.

All of this has nothing to do with "broken" and enshrining anything in the kapi.
That's just bullshit.

The crux of the discussion is the meaning of the API.
The design of the RDMA stack is that Verbs are used by core IB services, such 
as addressing.
For these services, as the specification requires, all relevant fields must be 
reported in the CQE, period.
All spec-compliant HW devices follow this.

If a ULP wants to create an address handle from a completion, there are service 
routines to accomplish that, based on the reported fields.
If it doesn't care, there is no reason to sacrifice performance.

--Liran

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-13 Thread Liran Liss
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-

> 
> > You are pushing abstraction into provider code instead of handling it in a
> generic way.
> 
> No, I am defining an API that *make sense* and doesn't leak useless details.
> Of course that doesn't force code duplication or anyhting like that, just
> implement it smartly.
> 
> I think mlx made a big mistake returning network_type instead of gid index,
> and I don't want to see that error enshrined in our API.
> 
> > The Verbs are a low-level API, that should report exactly what was
> > received from the wire.  In the RoCEv2 case, it should be the GID/IP
> > addresses and the protocol type.  The addressing information is not
> > intended to be used directly by applications; it is the raw bits that
> > were accepted from the wire.
> 
> Low level details isn't what any in kernel consumer needs. Everything in
> kernel needs the gid index to determine the namespace, routing and other
> details. It is not optional. A common API is thus needed to do this 
> conversion.

The Verbs are not intended only for kernel consumers, but also for the ib_core, 
cma, etc.
For the ib_core, a provider needs to report *all* relevant information that is 
not visible in the packet payload.
The network type is part of this information.
The proposed changes are a straightforward extension to the code base, directly 
follow the specification, and adhere to the RDMA stack design in which IP 
addressing is handled by the cma.

Also, I don't want to do any route resolution on the Rx path.
A UD QP completion just reports the details of the packet it received.

Conceptually, an incoming packet may not even match an SGID index at all.
Maybe, responses should be sent from another device. This should not be decided 
that the point that a packet was received.

> 
> API-wise, once you get the gid index then it is trivial to make easy 
> extractors
> for everything else. ie for example:
> rdma_get_ud_src_sockaddr(gid_index,,wc,grh)
> rdma_get_ud_dst_sockaddr(gid_index,,wc,grh)
> 

Nice ideas, but not relevant to completions.
The resolved dst address could point to another port or device completely.
The proper way to handle remote UD addresses is by rdma_ids that encapsulate 
address handles and bound devices.

> > ib_init_ah_from_wc() and friends is exactly the place that you want to
> > create an address handle based on completion and packet fields.
> 
> CMA needs exactly the same logic as well, the fact it doesn't have it is a bug
> in this series.
> 

init_ah_from_wc() should include a route lookup, and this will be fixed.

> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the
> body of a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-09 Thread Liran Liss
> From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com]

> Look, here is a concrete direction:
> 
> Replace all the crap in
> ib_init_ah_from_wc/get_sgid_index_from_eth/rdma_addr_find_dmac_by_
> grh
> 
> with a straightforward
> 
>rdma_dgid_index_from_wc(
> const struct ib_qp *qp,
> const struct ib_wc *wc,
>   const struct ib_grh *grh,
>   u16 *gid_index)
> 
> Sort of function that reads the GRH and wc and returns the unambiguous gid
> index that was used to receive that packet on the UD QP.
> 
> The gid cache is not allowed to create an ambiguity the driver cannot resolve.
> 
> That said, I wouldn't object to vendor-specific bits in the wc. Ie if mlx
> hardware needs a network_type bit to implement
> rdma_find_dgid_index_from_wc, then fine - define a vendor specific place
> to put it. In this case rdma_find_dgid_index_from_wc would be a driver call
> back, which is fine, and what Caitlin was talking about.
> 
> But, it is not part of our verbs API, and I'd *strongly* encourage other
> vendors and future hardware to simply return the gid index that the
> hardware matched instead of requiring the software to try and guess after
> the fact.
> 

You are pushing abstraction into provider code instead of handling it in a 
generic way.

The Verbs are a low-level API, that should report exactly what was received 
from the wire.
In the RoCEv2 case, it should be the GID/IP addresses and the protocol type.
The addressing information is not intended to be used directly by applications; 
it is the raw bits that were accepted from the wire.
I don't want to replicate resolution code in every RoCE driver.

ib_init_ah_from_wc() and friends is exactly the place that you want to create 
an address handle based on completion and packet fields.
This is what applications will use (and only if they are interested in 
generating a response; otherwise this overhead can be skipped).

As I said earlier, we can make sure that protocol type information doesn't 
incur any additional overhead.
--Liran

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next V1 5/9] IB/core: Add rdma_network_type to wc

2015-11-30 Thread Liran Liss
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-

> 
> The abstraction at the gid cache is making it too easy to make this mistake. 
> It
> is enabling callers to do direct gid lookups without a route lookup, which is
> unconditionally wrong. Every call site into the gid cache I looked at appears 
> to
> have this problem.
> 
> The simplest fix is to have a new gid cache api for rocve2 that somehow
> forces/includes the necessary route lookup. The existing API cannot simply
> be extended for rocev2.
> 

I think that the GID cache should remain just that: a cache.
We shouldn't bloat it.
The CMA is the proper place to handle IP resolution.

> > roce_gid_mgmt, is the part that populates this "dumb" database.
> > IMHO, adding such a "smart" layer to the GID cache is wrong, as this
> > should be part of RDMACM which does the translation. No need to get
> > the gid cache involved.
> 
> OK. Change the gid cache so only a RDMA CM private API can return
> rocev2 gids.
> 

The same cache is also used in IB and thus by other components, so it cannot
be a private CM API.
RoCE ULPs use the CMA for establishing connections, so route lookups should
be done from there.
 --Liran

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next V7 6/6] IB/ucma: HW Device hot-removal support

2015-08-11 Thread Liran Liss
 From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com]
 
 It does if you want a planned 'gental' removal to be possible..
 

There could be a lot of design options for a 'gentle' removal, such as first 
sending a 'prepare' event, and only then doing the flow proposed here.
I do not want to address this now.

  When you do a surprise removal, you disconnect the application from
  both ucma and uverbs device references.  In this state, the only
  thing that the application can do is close all handles.  It doesn't
  matter if you succeed closing a resource on a device that is still
  accessible, or close an already destroyed resource...
 
 Hurm. That makese sense, but that isn't entirely what this patch set
 does - it is rough when facing the user, but the kernel side is
 actually quite gental.. Ie the RDMA CM still sends GMPs on this
 shutdown path.. Looks like the various storage ULP shutdowns are
 pretty gental too.
 

Right.
Kernel removal is always gentle, because kernel ULPs are trusted to close 
within a finite time.

 As an example, I think it is important, if you
 want to hot-unplug a storage controller the admin expects a clean
 storage shutdown. I think in-kernel storage drivers do that
 already. IB shouldn't really be much different.
 

If a storage stack has user-space components, then our future 'gentle' closing 
will take care of that.

  Anyway, let's get this patch-set done.
 
 Still needs to have the locking fixed..

As pointed out 
(https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg27023.html), this 
barrier scheme is used in several places including in the RDMA stack.
If there is no functional issue with the code, we can discuss how to address 
this scheme systematically later.

--Liran
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next V7 6/6] IB/ucma: HW Device hot-removal support

2015-08-06 Thread Liran Liss
 From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
 ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe
 
 I have no real problem with that, it would be nice to have an answer
 to the uverbs vs ucma removal ordering question and the basic issue of
 if we even want to do this so async before merging the driver patch
 that enables it..
 

I don't think that the order matters.
When you do a surprise removal, you disconnect the application from both ucma 
and uverbs device references.
In this state, the only thing that the application can do is close all handles.
It doesn't matter if you succeed closing a resource on a device that is still 
accessible, or close an already destroyed resource...

The primary reason for device surprise removal is to recover from device fatal 
errors or PCI errors.
A quick asynchronous device recovery is exactly what we need, as there is 
nothing that the application can do more at this state.
This is the most pressing issue that this patch-set solves.

For the administrative removal use case, can consider providing applications a 
grace period before disconnecting them in a future patch set.
Even here, however, I am not sure that if an administrator decided to remove a 
device while there are still active applications, we should stall his request.
Anyway, let's get this patch-set done.

  The first 4 you already review and acked the fifth is in mlx4_ib and has no
  comments to fix so far - some months in the list.
 
 Personally, I'm not looking much at drivers..
 
 Jason

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v1 08/12] IB/cma: Add net_dev and private data checks to RDMA CM

2015-07-16 Thread Liran Liss
 From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com]

  After all, it is the payload that designates the entity that you
  want to establish a connection to, rather than the packet headers,
  which are just meant to relay the packet to the proper CM
 
 No, that isn't right. The IBA uses the GMP's destination first, then
 serviceID as the demux. Services IDs are not globally unique, they are
 scoped by the destination.
 

The destination is the physical CA port and kernel CM agent, so I don't think 
the answer is that clear.
Going forward along these lines:
- Name space lookup is done based on BTH.pkey, private_data.IP, and optionally 
GRH.DGID (if present, for extra validation)
-- Primary and alternate paths are not considered at all

- If P_Key enforcement is set up via cgroups:
-- For CM processing, we only check BTH.pkey
--- Upon conflict, the packet is dropped
-- The primary/alternate path pkeys are not validated by CM, but will be 
validated during QP modification

Does this sound OK?

In any case, let's complete the namespace lookup first, and then follow up with 
a cgroup patchset.

 The path data is just *routing* it doesn't describe at all the entity
 we want to talk to, it is only a proposal for how to flow data to it.
 
 In any event, both the GMP headers and the path data needs to be
 checked against the container's pkey list. I don't know why this is so
 contentions.
 
 Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v1 08/12] IB/cma: Add net_dev and private data checks to RDMA CM

2015-07-15 Thread Liran Liss
 From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com]
 
 
  What is really missing here I guess is a mechanism that would
  enforce containers to only use certain pkeys - perhaps with
  something like an RDMA cgroup.  It could force containers to only
  use approved pkeys not only with RDMA CM, but through uverbs, and
  other user-space interfaces.
 
 Ideally yes. Without that it just means you can't use pkey
 meaningfully with containers..
 

That's why partition enforcement should be separated from namespaces.

If you want to restrict a container to a specific set of pkeys, use cgroups.
This would apply both to CM MADs and QPs.
- In the MAD case, CM MADs would be first matched to a namespace and rdma_id 
and dropped upon pkey conflict (with either the headers or the payload).
- In the QP case, modify_qp() would fail on conflict.
Partitioning needs to be enforced also for applications that don't use the CM 
at all...

For namespaces, it seems more natural to lookup the namespace based solely on 
the CM payload.
After all, it is the payload that designates the entity that you want to 
establish a connection to, rather than the packet headers, which are just meant 
to relay the packet to the proper CM agent. (Spec-wise, it is even possible to 
use completely different node to open a connection on behalf of another node.)
So, the sole role of pkeys in the context of namespaces is to locate the proper 
namespace; not for partition isolation.

So, as a first step, let's get the namespace lookup in place.
Next, add an rdma crgoup, which would control pkeys, as well as resource counts 
(you don't want one container to rob the whole system from all QPs...).

 Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-18 Thread Liran Liss
 From: Weiny, Ira [mailto:ira.we...@intel.com]

  ib_verbs define an *extensive* direct HW access API, which is constantly
  evolving.
 
 This is the problem with verbs...

Huh?
It is its strength, if you don't break backward compatibility...

 
  You cannot describe the intricate object relations and semantics through an
  API.
  In addition, you can't abstract anything or fix stuff in SW.
  The only way to *truly* know what to expect when performing Verbs calls
 is to
  check the node type.
 
 How can you say this?
 
 mthca, mlx4, mlx5, and qib all have different sets of functionality... all 
 with
 the same node type.  OPA has the same set as qib...  same node type.
 

Only that qib is IB, which is fully interoperable with mlx*

 
  ib_verbs was never only an API. It started as the Linux implementation of
 the
  IBTA standard, with guaranteed semantics and wire protocol.
  Later, the interface was reused to support additional RDMA devices.
 However,
  you could *always* check the node type if you wanted to, thereby retaining
 the
  standard guarantees. Win-win situation...
 
 Not true at all.  For example, Qib does not support XRC and yet has the same
 node type as mlx4 (5)...

The node type is for guaranteeing semantics and interop for the features that 
you do implement...

 
 
  This is a very strong property; we should not give up on it.
 
 On the contrary the property is weak and implies functionality or lack of
 functionality rather than being explicit.  This was done because getting
 changes to kernel ABIs was hard and we took a shortcut with node type
 which we should not have.  OPA attempts to stop this madness and supports
 the functionality of verbs _As_ _Defined_ rather than creating yet another
 set of things which applications need to check against.
 

I totally agree that we should improve the expressiveness and accuracy of our 
capabilities;
you don't need OPA for this. Unfortunately, it is not always the case. 

Also, there are behaviors that are not defined by the API, but still rely on 
the node type.
Management applications , for example.

--Liran


RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-14 Thread Liran Liss
 From: Doug Ledford [mailto:dledf...@redhat.com]

  But the node_type stands for more than just an abstract RDMA device:
  In IB, it designates an instance of an industry-standard, well-defined,
 device type: it's possible link types, transport, semantics, management,
 everything.
  It *should* be exposed to user-space so apps that know and care what
 they are running on could continue to work.
 
 I'm sorry, but your argument here is not very convincing at all.  And
 it's somewhat hypocritical.  When RoCE was first introduced, the *exact*
 same argument could be used to argue for why RoCE should require a new
 node_type.  Except then, because RoCE was your own, you argued for, and
 got, an expansion of the IB node_type definition that now included a
 relevant link_layer attribute that apps never needed to care about
 before.  However, now you are a victim of your own success.  You set the
 standard then that if the new device can properly emulate an IB Verbs/IB
 Link Layer device in terms of A) supported primitives (iWARP and usNIC
 both fail here, and hence why they have their own node_types) and B)
 queue pair creation process modulo link layer specific addressing
 attributes, then that device qualifies to use the IB_CA node_type and
 merely needs only a link_layer attribute to differentiate it.
 

No. RoCE is as an open standard from the IBTA with the exact same RDMA protocol 
semantics as InfiniBand and a clear set of compliancy rules without which an 
implementation can't claim to be such. A RoCE device *is* an IB CA with an 
Ethernet link.
In contrast, OPA is a proprietary protocol. We don't know what primitives are 
supported, and whether the semantics of supported primitives are the same as in 
InfiniBand.

 The new OPA stuff appears to be following *exactly* the same development
 model/path that RoCE did.  When RoCE was introduced, all the apps that
 really cared about low level addressing on the link layer had to be
 modified to encompass the new link type.  This is simply link_layer
 number three for apps to care about.
 

You are missing my point. API transparency is not a synonym for full semantic 
equivalence.  The Node Type doesn’t indicate level of adherence to an API. Node 
Type indicates compliancy to a  specification (e.g. wire protocol, remote order 
of execution, error semantics, architectural limitations, etc). The IBTA CA and 
Switch Node Types belong to devices that are compliant to the corresponding 
specifications from the InfiniBand Trade Association.  And that doesn’t prevent 
applications to choose to be coded to run over nodes of different Node Type as 
it happens today with IB/RoCE and iWARP.

This has nothing to do with addressing.



RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-11 Thread Liran Liss
 From: Doug Ledford [mailto:dledf...@redhat.com]

OPA cannot impersonate IB; OPA node and link types have to be
designated as such.  In terms of MAD processing flows, both
explicit (as in the handle_opa_smi() call below) and implicit code
paths (which share IB flows - there are several cases) must make
this distinction.
  
   As far as in the kernel is concerned, the individual capability bits
   are much more important.  I would actually like to do away with the
   node_type variable from struct ib_device eventually.  As for user
   space,

We agreed on the concept of capability bits for the sake of simplifying code 
sharing.
That is OK.

But the node_type stands for more than just an abstract RDMA device:
In IB, it designates an instance of an industry-standard, well-defined, device 
type: it's possible link types, transport, semantics, management, everything.
It *should* be exposed to user-space so apps that know and care what they are 
running on could continue to work.

The place for abstraction is in the rdmacm/CMA, which serves applications that 
just
want some RDMA functionality regardless of the underlying technology.

 
  All SMI code has different behavior if it is running on a switch or
  HCA, so testing for 'switchyness' is very appropriate here.
 
 Sure...
 
  cap_is_switch_smi would be a nice refinement to let us drop nodetype.
 
 Exactly, we need a bit added to the immutable data bits, and a new cap_
 helper, and then nodetype is ready to be retired.  Add a bit, drop a
 u8 ;-)
 

This is indeed a viable solution.

  I don't have a problem with sharing the IBA constant names for MAD
  structures (like RDMA_NODE_IB_SWITCH) between IB and OPA code. They
  already share the structure layouts/etc.
 

The node type is reflected to user-space, which, as I mentioned above, is 
important.
Abusing this enumeration is misleading, even in the kernel.
Jason's proposal for a 'cap_is_switch_smi' is more readable, and directly in 
line with
the explicit capability approach that we discussed.



RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-10 Thread Liran Liss
 From: Ira Weiny ira.we...@intel.com

Hi Ira,

OPA cannot impersonate IB; OPA node and link types have to be designated as 
such.
In terms of MAD processing flows, both explicit (as in the handle_opa_smi() 
call below) and implicit code paths (which share IB flows - there are several 
cases) must make this distinction.

 +static enum smi_action
 +handle_opa_smi(struct ib_mad_port_private *port_priv,
 +struct ib_mad_qp_info *qp_info,
 +struct ib_wc *wc,
 +int port_num,
 +struct ib_mad_private *recv,
 +struct ib_mad_private *response)
 +{
...
 + } else if (port_priv-device-node_type == RDMA_NODE_IB_SWITCH)  

--Liran
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-05-28 Thread Liran Liss
 
  Why do you have RDMA_NODE_IB_SWITCH related stuff inside the
 handle_opa_smi() function?
  Is there a node type of switch in OPA similar to IB?
 
 
 Yes.  OPA uses the same node types as IB.
 
 Ira
 

No, OPA cannot impersonate IB.
It has to have distinct node and link types.

--Liran

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RESEND PATCH V3 for-next 0/3] HW Device hot-removal support

2015-05-28 Thread Liran Liss
 From: Doug Ledford [mailto:dledf...@redhat.com]

  I suppose that the main issue would be handling existing user memory
 mappings,
  which cannot be just invalidated -- the user-space driver may not be aware
 of the
  device removal and may access this memory concurrently, and we don't
 want it
  to crash.
 
 In this case, you are mapping it out of the device BAR space and into a
 random kernel page, yes?  So, if the driver doesn't catch the
 DRIVER_FATAL event and process that to mean don't bother touching this
 RDMA device any more, it's going to write to a mailbox that no longer
 responds and have infinite timeouts, yes?  Essentially meaning all
 mailbox type operations just go into lala land from here on out, right?

The kernel activity is asynchronous to user-space.
The device may be un-plugged before the user-space driver has a chance to learn 
that a DEVICE_FATAL event has occurred. In fact, in the current user-space 
stack design, device drivers don't have a context of their own to read() from 
file descriptors and rely on the application for that.
But even so, you probably don't want a driver to invoke a system call during 
the fast path just to 



N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

RE: [RESEND PATCH V3 for-next 0/3] HW Device hot-removal support

2015-05-28 Thread Liran Liss
 From: Doug Ledford [mailto:dledf...@redhat.com]
  I suppose that the main issue would be handling existing user memory
 mappings,
  which cannot be just invalidated -- the user-space driver may not be aware
 of the
  device removal and may access this memory concurrently, and we don't
 want it
  to crash.
 
 In this case, you are mapping it out of the device BAR space and into a
 random kernel page, yes?  So, if the driver doesn't catch the
 DRIVER_FATAL event and process that to mean don't bother touching this
 RDMA device any more, it's going to write to a mailbox that no longer
 responds and have infinite timeouts, yes?  Essentially meaning all
 mailbox type operations just go into lala land from here on out, right?
 

Pressed 'send' too early...

The kernel activity is asynchronous to user-space.
The device may be un-plugged before the user-space driver has a chance to learn 
that a DEVICE_FATAL event has occurred. In fact, in the current user-space 
stack design, device drivers don't have a context of their own to read() from 
file descriptors and rely on the application for that.
But even so, you probably don't want a driver to invoke a system call during 
the fast path just to check this condition.

For devices that just write the BAR space, an arbitrary kernel page would do.
Other devices might wish to first populate this page so that the user-space 
driver can detect this situation efficiently.

--Liran



RE: [PATCH v6 02/26] IB/Verbs: Implement raw management helpers

2015-04-24 Thread Liran Liss
 From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
 
 Add raw helpers:
   rdma_tech_ib
   rdma_tech_iboe
   rdma_tech_iwarp
   rdma_ib_or_iboe (transition, clean up later) To help us detect which
 technology the port supported.
 

Replace rdma_tech_* with rdma_protocol_*.

 Cc: Hal Rosenstock h...@dev.mellanox.co.il
 Cc: Steve Wise sw...@opengridcomputing.com
 Cc: Tom Talpey t...@talpey.com
 Cc: Jason Gunthorpe jguntho...@obsidianresearch.com
 Cc: Doug Ledford dledf...@redhat.com
 Cc: Ira Weiny ira.we...@intel.com
 Cc: Sean Hefty sean.he...@intel.com
 Signed-off-by: Michael Wang yun.w...@profitbricks.com
 ---
  include/rdma/ib_verbs.h | 25 +
  1 file changed, 25 insertions(+)
 
 diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index
 d54f91e..a12e876 100644
 --- a/include/rdma/ib_verbs.h
 +++ b/include/rdma/ib_verbs.h
 @@ -1748,6 +1748,31 @@ int ib_query_port(struct ib_device *device,  enum
 rdma_link_layer rdma_port_get_link_layer(struct ib_device *device,
  u8 port_num);
 
 +static inline int rdma_tech_ib(struct ib_device *device, u8 port_num) {
 + return device-query_transport(device, port_num)
 + == RDMA_TRANSPORT_IB;
 +}
 +
 +static inline int rdma_tech_iboe(struct ib_device *device, u8 port_num)
 +{
 + return device-query_transport(device, port_num)
 + == RDMA_TRANSPORT_IBOE;

Remove RDMA_TRANSPORT_IBOE.
In the current code, the test should be: (IB transport  Ethernet link layer).

We can later consider each provider declaring the transports directly.

 +}
 +
 +static inline int rdma_tech_iwarp(struct ib_device *device, u8
 +port_num) {
 + return device-query_transport(device, port_num)
 + == RDMA_TRANSPORT_IWARP;
 +}
 +
 +static inline int rdma_ib_or_iboe(struct ib_device *device, u8
 +port_num) {
 + enum rdma_transport_type tp = device-query_transport(device,
 +port_num);
 +
 + return (tp == RDMA_TRANSPORT_IB || tp ==
 RDMA_TRANSPORT_IBOE); }

Remove RDMA_TRANSPORT_IBOE.
Just test against RDMA_TRANSPORT_IB.

 +
  int ib_query_gid(struct ib_device *device,
u8 port_num, int index, union ib_gid *gid);
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v5 00/27] IB/Verbs: IB Management Helpers

2015-04-24 Thread Liran Liss
 From: Hefty, Sean [mailto:sean.he...@intel.com]

[snip]
   So, I think that our old-transport below is just fine.
   No need to change it (and you aren't, since it is currently
   implemented
  as a function).
 
  I think there is a need to change this.  Encoding the transport into
  the node type is not a good idea.  Having different transport
  semantics while still returning the same transport for the port is
  confusing.
 
  The only thing which is clear currently is Link Layer.
 
  But the use of Link Layer in the code is so convoluted that it is
  very confusing.
 
 I agree.
 
 One could implement software iWarp or IBoUDP (RoCEv2) protocols that
 could run over any link layer and interoperate with existing HW solutions.
 The stack shouldn't be dealing with the link level at all, with the exception 
 of
 user space compatibility.
 
  Define Transport?  There has been a lot of discussion over what a
  transport is in Verbs.
 
 IMO, we should replace using the word 'transport' with just 'rdma_protocol'.
 And even then I'm not convinced that anything should care, beyond user
 space compatibility.  The caps are what matter.
 
 - Sean

I completely agree.
If we ever see a need for representing a set or subset of cross-layer protocols 
(at any level, L2-L4, various encapsulations), we will add the proper 
management helpers.
For example:
- rdma_protocol_roce() /* both v1 and v2 */
- rdma_protocol_roce_v1()
- rdma_protocol_roce_v2()
- rdma_protocol_usnic()
- rdma_protocol_usnic_udp()



RE: [PATCH v5 00/27] IB/Verbs: IB Management Helpers

2015-04-24 Thread Liran Liss
 From: Michael Wang [mailto:yun.w...@profitbricks.com]


 [snip]
 
  Depends on who is we.
  For ULPs, you are probably right.
 
  However, core services (e.g., mad management, CM, SA) do care about
 various details.
  In some cases, where it doesn't matter, this code will use management
 helpers.
  In other cases, this code will inspect link, transport, and node attributes 
  of
 rdma devices.
 
  For example, the CM code has specific code paths for IB, RoCE, and iWARP.
  There is no other CM code; there is no reason to abstract 'CM'. This
  code will have code paths that depend on various specific details.
 
 That's exactly what we want to stop, we have classified the CM to IB and
 IWARP now :-)


We don't want to stop code branches that are not abstractions but rather depend
on the specific technology!
There is no generic iWARP CM - only one.
There is no generic ROCE CM - only one.
There is no generic IB CM - only one.

At the CM high-level (i.e., whether an ib_dev port registers an IB client), you 
could consider
an rdma_has_cm() call, but this the only place in the code that this check will 
be called!
Hence, no need for a generic check.

You want to stop abstract code that uses IB core infrastructure.

 
  This new transport is only understand by core-layer currently, for
  user-layer we still reserve the old transport for them, next step is
  to use bitmask instead of transport, at that time we can erase the
  new transport and make the whole stuff used by user-layer only :-)
 
 
  I am not sure that we need a bit mask at all.
  Your helpers already provide all the useful abstractions, which both core
 and ULPs call directly.
  All the information is inferred directly from link, transport, node 
  tuples.
 
  Some of the user-space tools need *exactly* the same reasoning.
  For example, management tools manage specific technologies and
 protocols, not some abstraction.
 
  So, For user-space, we can think about exposing exactly the same
  helper framework, while providing backward compatibility for the existing
 interfaces.
 
 I'd really like to put the topic on bitmask and user app reform into different
 thread...
 
 bitmask should be next topic, there are many discussion already, but I could
 imaging far more discussion there, the user reform should be the last step,
 after every thing in kernel settled down :-)
 

OK

 
 
 
  Detailed remarks
  ==
 
  1) The introduction of cap_*_*() stuff should have been introduced
  directly
  in patch 02/27.
  This back-and-forth between rdma_ib_or_iboe() and cap_* is confusing
  and
  increases the number of patches in the patch-set.
  Do this and remove patches 16-24.
 
  We have some discussion about compress the patch set, merge the
  reform and introducing patch will mix the concept (like the earlier
  version), IMHO it will increase the difficulty of review...
 
  And now since many review already been done, it's not wise to change
  the whole structure of patch set IMHO...
 
 
  I think it is because you are conditioning code on one thing, and then
  conditioning the same code on another thing.
 
  This is confusing.
 
  Once we get our abstractions correct (i.e., the right helper
  functions), you replace the existing logic with the suitable helper 
  up-front.
 
 We need to classify and integrate the concept into mgmt helper, that would
 be very helpful for further reform, reform followed by integration sounds not
 that bad, correct?
 

The problem is that it is hard to follow the reasoning for the first use 
consumer
code with the in-complete helper frame-work.

 
 
  2)The name rdma_tech_* is lame.
  rdma_transport_*(), adhering to the above (*) remark, is much better.
  For example, both IB and ROCE *do* use the same transport.
 
  We have some discussion on that too, use transport means going back...
 
 
  No.
  The existing notion of transport was correct. It was the node type that
 wasn't.
  And in any case the new helpers didn't use it.
 
  We need the original meaning of transport - see my response to Ira.
  I propose replacing rdma_node_get_transport() with the following helpers:
  - rdma_get_transport()
  - rdma_is_ib_transport()
  - rdma_is_iwarp_transport()
 
 We can change the name at anytime, tech/transport/protocol/standard, just
 one patch later can easily change it and start the topic of naming, any of
 these name will unsatisfied someone AFAIK, I'd like to suggest we consider
 this as a mark temporarily and focus on the logical issue.

Sure.

The logical issue is:

1. We need the existing notion of transport, meaning a bunch of L4+headers + 
semantics presented to apps.
2. We might need an *additional* notion of rdma_protocol, which designates a 
complete wire-format: L2-L4+ including.
This could be later a bitmask, a management helper, whatever.
Currently, I don't see anything in the existing code that would call such 
helpers.

 
  - ...
 
 
  3) The name cap_* as it is used above is not accurate.
  You use it to 

RE: [PATCH v6 01/26] IB/Verbs: Implement new callback query_transport()

2015-04-24 Thread Liran Liss
 From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
 
[snip]
 a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index
 65994a1..d54f91e 100644
 --- a/include/rdma/ib_verbs.h
 +++ b/include/rdma/ib_verbs.h
 @@ -75,10 +75,13 @@ enum rdma_node_type {  };
 
  enum rdma_transport_type {
 + /* legacy for users */
   RDMA_TRANSPORT_IB,
   RDMA_TRANSPORT_IWARP,
   RDMA_TRANSPORT_USNIC,
 - RDMA_TRANSPORT_USNIC_UDP
 + RDMA_TRANSPORT_USNIC_UDP,
 + /* new transport */
 + RDMA_TRANSPORT_IBOE,

Remove RDMA_TRANSPORT_IBOE - it is not a transport.
ROCE uses IBTA transport.

If any code should test for ROCE should invoke a specific helper, e.g., 
rdma_protocol_iboe().
This is  what you currently call rdma_tech_iboe is patch 02/26.

I think that pretty much everybody agrees that rdma_protocol_*() is a better 
name than rdma_tech_*(), right?
So, let's change this.

The semantics are: check that a link supports a certain wire protocol, or a 
set of wire protocols, where 'certain'
refers to the specific helper...


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v5 00/27] IB/Verbs: IB Management Helpers

2015-04-22 Thread Liran Liss
 From: Michael Wang [mailto:yun.w...@profitbricks.com]
 
 Hi, Liran
 
 Thanks for the comment :-)
 
 On 04/22/2015 01:36 AM, Liran Liss wrote:
 [snip]
 
  (**) This has been extended to also encode the transport in the current
 code.
  At least for user-space visible APIs, we might chose to leave this for
 backward compatibility, but we can consider cleaning up the kernel code.
 
  So, I think that our old-transport below is just fine.
  No need to change it (and you aren't, since it is currently implemented as a
 function).
 
  The new-transport does not really exist, but is broken into several
 capability checks of the L4 transport, optionally with conditions on the link
 type.
  I would remove the table below and tell what we really want to achieve:
  == move technology-specific feature-check logic out of the (multiple!) IB
 code components and various ULPs into per-feature helpers.
 
 Our purpose is to help core layer do management more clearly, rather then
 referring from transport and linklayer.
 

Right.

 IMHO from management's point of view, what we really care about is whether
 a particular management required by device or not, rather then the details on
 transport and link layer.
 

Depends on who is we.
For ULPs, you are probably right.

However, core services (e.g., mad management, CM, SA) do care about various 
details.
In some cases, where it doesn't matter, this code will use management helpers.
In other cases, this code will inspect link, transport, and node attributes of 
rdma devices.

For example, the CM code has specific code paths for IB, RoCE, and iWARP.
There is no other CM code; there is no reason to abstract 'CM'. This code will 
have code
paths that depend on various specific details.

 This new transport is only understand by core-layer currently, for user-layer
 we still reserve the old transport for them, next step is to use bitmask 
 instead
 of transport, at that time we can erase the new transport and make the
 whole stuff used by user-layer only :-)
 

I am not sure that we need a bit mask at all.
Your helpers already provide all the useful abstractions, which both core and 
ULPs call directly.
All the information is inferred directly from link, transport, node tuples.

Some of the user-space tools need *exactly* the same reasoning.
For example, management tools manage specific technologies and protocols, not 
some abstraction.

So, For user-space, we can think about exposing exactly the same helper 
framework, while providing
backward compatibility for the existing interfaces.

 
 
  Detailed remarks
  ==
 
  1) The introduction of cap_*_*() stuff should have been introduced directly
 in patch 02/27.
  This back-and-forth between rdma_ib_or_iboe() and cap_* is confusing and
 increases the number of patches in the patch-set.
  Do this and remove patches 16-24.
 
 We have some discussion about compress the patch set, merge the reform
 and introducing patch will mix the concept (like the earlier version), IMHO it
 will increase the difficulty of review...
 
 And now since many review already been done, it's not wise to change the
 whole structure of patch set IMHO...
 

I think it is because you are conditioning code on one thing, and then 
conditioning
the same code on another thing.

This is confusing.

Once we get our abstractions correct (i.e., the right helper functions), you 
replace the
existing logic with the suitable helper up-front.

 
  2)The name rdma_tech_* is lame.
  rdma_transport_*(), adhering to the above (*) remark, is much better.
  For example, both IB and ROCE *do* use the same transport.
 
 We have some discussion on that too, use transport means going back...
 

No.
The existing notion of transport was correct. It was the node type that wasn't.
And in any case the new helpers didn't use it.

We need the original meaning of transport - see my response to Ira.
I propose replacing rdma_node_get_transport() with the following helpers:
- rdma_get_transport()
- rdma_is_ib_transport()
- rdma_is_iwarp_transport()
- ...

 
  3) The name cap_* as it is used above is not accurate.
  You use it to describe technology characteristics rather than extendable
 capabilities.
  I would suggest having a single convention for all helpers, such as
 rdma_has_*() and rdma_is_*().
  For example: cap_ib_smi() == rdma_has_smi().
 
 That means going back too...

See response to Ira (https://lkml.org/lkml/2015/4/21/951).


 
 
  4) Remove all capabilities that do not introduce any distinction in the
 current code.
  We can add them as needed later.
  This means remove patches:
  - [PATCH v5 22/27] IB/Verbs: Use management helper cap_ipoib() – all
  IB devices support ipoib
  - [PATCH v5 24/27] IB/Verbs: Use management helper cap_af_ib() – all IB
 devices support AF_IB.
 
  On the other hand:
  - rdma_has_multicast() makes sense, since iWARP doesn’t support it.
  - cap_ib_sa() might make sense to cut code even further in the CMA, since
 RoCE has a GSI but no SA.
 
 We

RE: [PATCH v5 00/27] IB/Verbs: IB Management Helpers

2015-04-21 Thread Liran Liss
Hi Michael,

The spirit of this patch-set is great, but I think that we need to clarify some 
concepts.
Since this will affect the whole patch-set, I am laying out my concerns here 
instead.

A suggestion for the resulting management helpers is given below.
I believe the result would be much more coherent.
--Liran

In general


An ib_dev (or a port of) should be distinguished by 3 qualifiers:
- The link layer:
-- Ethernet (shared by iWARP, USNIC, and ROCE)
-- Infiniband

- The transport (*)
-- IBTA transport (shared by IB and ROCE)
-- iWARP transport
-- USNIC transport

(*) Transport means both:
- The L4 wire protocols (e.g., BTH+ headers of IBTA, optionally encapsulated by 
UDP in ROCEv2, or the iWARP stack)
- The transport semantics (for example, there are slight semantic differences 
between IBTA and iWARP)

- The node type (**)
-- CA
-- Switch
-- Router

(**) This has been extended to also encode the transport in the current code.
At least for user-space visible APIs, we might chose to leave this for backward 
compatibility, but we can consider cleaning up the kernel code.

So, I think that our old-transport below is just fine.
No need to change it (and you aren't, since it is currently implemented as a 
function).

The new-transport does not really exist, but is broken into several 
capability checks of the L4 transport, optionally with conditions on the link 
type.
I would remove the table below and tell what we really want to achieve:
== move technology-specific feature-check logic out of the (multiple!) IB code 
components and various ULPs into per-feature helpers.


Detailed remarks
==

1) The introduction of cap_*_*() stuff should have been introduced directly in 
patch 02/27.
This back-and-forth between rdma_ib_or_iboe() and cap_* is confusing and 
increases the number of patches in the patch-set.
Do this and remove patches 16-24.

2)The name rdma_tech_* is lame.
rdma_transport_*(), adhering to the above (*) remark, is much better.
For example, both IB and ROCE *do* use the same transport. 

3) The name cap_* as it is used above is not accurate.
You use it to describe technology characteristics rather than extendable 
capabilities.
I would suggest having a single convention for all helpers, such as 
rdma_has_*() and rdma_is_*().
For example: cap_ib_smi() == rdma_has_smi().

4) Remove all capabilities that do not introduce any distinction in the current 
code.
We can add them as needed later.
This means remove patches:
- [PATCH v5 22/27] IB/Verbs: Use management helper cap_ipoib() – all IB devices 
support ipoib
- [PATCH v5 24/27] IB/Verbs: Use management helper cap_af_ib() – all IB devices 
support AF_IB.

On the other hand:
- rdma_has_multicast() makes sense, since iWARP doesn’t support it.
- cap_ib_sa() might make sense to cut code even further in the CMA, since RoCE 
has a GSI but no SA.

5) Do no modify phys_state_show() in [PATCH v5 09/27] IB/Verbs: Reform IB-core 
verbs/uverbs_cmd/sysfs
It *is* the link layer!

6) Remove cap_read_multi_sge
It is not device/port feature, but a transport capability.
Use rdma_is_iwarp_transport() instead, or introduce a new transport flag in 
'enum ib_device_cap_flags'.
 
7) Remove [PATCH v5 25/27] IB/Verbs: Use management helper cap_eth_ah().
Address handles that refer to Ethernet links always have Ethernet addressing.

In the CMA code, using rdma_tech_iboe() is just fine. This is how you define 
cap_eth_ah() anyway.
Currently, this patch just adds clutter.

8) Remove patch [PATCH v5 26/27] IB/Verbs: Clean up rdma_ib_or_iboe().
We do need a transport qualifier, as exemplified in comment 5) above, and for a 
complete clean model.
This is after renaming the function to rdma_is_ib_transport()...


Putting it all together
==

We are left with the following helpers:
- rdma_is_ib_transport()
- rdma_is_iwarp_transport()
- rdma_is_usnic_transport()
- rdma_is_iboe()
- rdma_has_mad()
- rdma_has_smi()
- rdma_has_gsi() - complements smi; can be used by the mad code for clarity
- rdma_has_sa()
- rdma_has_cm()
- rdma_has_mcast()


 Subject: [PATCH v5 00/27] IB/Verbs: IB Management Helpers
 
 
 Since v4:
   * Thanks for the comments from Hal, Sean, Tom, Or Gerlitz, Jason,
 Roland, Ira and Steve :-) Please remind me if anything missed :-P
   * Fix logical issue inside 3#, 14#
   * Refine 3#, 4#, 5# with label 'free'
   * Rework 10# to stop using port 1 when port already assigned
 
 There are plenty of lengthy code to check the transport type of IB device, or
 the link layer type of it's port, but actually we are just speculating 
 whether a
 particular management/feature is supported by the device/port.
 
 Thus instead of inferring, we should have our own mechanism for IB
 management capability/protocol/feature checking, several proposals below.
 
 This patch set will reform the method of getting transport type, we will now
 using query_transport() instead of inferring from transport and link layer
 respectively, also we defined the 

RE: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-16 Thread Liran Liss
   RoCE v2 is really Infiniband over UDP over IP.  Why don't we just
   call
  it IBoUDP like it is?
  RoCEv2 is the name in the IBTA spec (Annex 17)
 
 We call RoCE IBoE in the kernel, because that's what it is.  RoCE is an IBTA
 marketing name.
 
 Looking through the Annex, I don't see where Ethernet is even a requirement
 for this technology to work.  The IB transport is layered over a standard UDP
 header.  I do see where the spec calls out updating the IP header, but that's 
 it.
 
 Regardless of what it's called, it replaces the underlying network and
 transport protocols, versus IB-classic or IBoE/RoCE.  That should be captured
 properly, not by saying there's a new GID type.  RoCE v2 doesn't even use
 GIDs as part of its protocol.  It uses UDP/IP addresses.
 
 

The RoCE Verbs interface references the HCA GID table in QPs and AHs, for all 
RoCE versions.
The specification mandates that the Verbs consumer may use the following 
protocols over the same RoCE-capable HCA and the same physical port:
- RoCEv1 (L2, IB GRH, IB BTH+, payload)
- RoCEv2 using IPv4 (L2, IPv4, UDP, IB BTH+, payload)
- RoCEv2 using IPv6 (L2, IPv6, UDP, IB BTH+, payload)
The distinction (by spec) is done by associating a GID type attribute to each 
GID table entry, which is either IB, IPv4, or IPv6.
This is how apps can create different RC QPs using these different wire 
protocols, or a single UD QP that can send packets to all of these wire 
protocols.

Perhaps we could add another enum entry for RoCEv1 in the patch to make this 
clearer.

   IBoUDP changes the Ethertype, replaces the network header, adds a
   new
  transport protocol header, and layers IB over that.  This change
  should be exposed properly and not as just a new GID type.
  I don't understand what do you suggest here. Can you give an example?
 
 I don't have a solution here.  Please look at Michael Wang's patch series and
 see how this would fit into that model.  The introduction of iWarp required
 defining a new 'transport' type.  IBoE added a new link layer.  Based on those
 changes, this would warrant introducing a new network layer, so that it can
 be distinguished properly from the other options.  Maybe that's the right
 approach?
 

The new-transport in Michael's patches doesn't refer to the network transport 
layer, but rather acts as a summary of the tuple link, transport, node_type 
(the network layer is indeed skipped).
The network transport layer of Infiniband and *all* RoCE types is the same: IB 
transport.
As I said earlier, the network layer (e.g., IPv4, IPv6, or GRH) cannot be a 
port attribute because RoCE HCAs support all of them over the same physical 
port.

Maybe we should change these patches to encode the port summary as a bit 
mask, and provide convenient masks for queries?
Alternatively, we could leave the port qualifiers as they are (i.e., with 
distinct link, transport, node_type qualifiers) instead of introducing 
new-transport, but provide convenient wrappers for ULPs to use.
For example:
- is_roce() /* returns true for all RoCE wire protocols */
- is_rocev1()
- is_rocev2()
- is_iwarp()
- ...

 Cisco's NIC reports a transport layer of 'USNIC_UDP', which should really just
 be 'UDP'.  That NIC supports UDP/IP/Ethernet, based on the rdma stack's
 model.
 RoCE v2 is also UDP/IP/Ethernet, it only layers IB over that.  (This makes the
 use of the term 'transport' confusing.  Maybe there should also be a 'session'
 protocol?)  It seems completely reasonable that a device which does
 IB/UDP/IP/Ethernet to someday expose the UDP/IP/Ethernet portion (if it
 doesn't already), and from the same port at the same time.

RoCEv2 uses UDP only as an encapsulation protocol (just like VXLAN).
The transport is well-defined: IBTA transport (just like Infiniband).
In the USNIC case, UPD is indeed the actual network transport.

 
 Rather than continuing to try to make everything look like an IB-classic 
 device
 because it's convenient, the stack needs to start exposing things properly.  I
 don't know what the right solution should be, but trying to capture this level
 of detail as a different GID type definitely looks like a step in the wrong
 direction.

As mentioned above, this is how Verbs consumers send traffic using the 
different wire-protocols over the same physical port.
This patch-set follows the specification, and cleanly provides all per-GID 
associations in 'struct ib_gid_attr'.

So basically, in RoCE, a GID entry represents a network interface (a netdev) to 
the HCA, and encompasses all information related with that interface: MAC, 
VLAN, MTU, IP address, namespace, etc.
This also makes it straight forward to add future extensions.

 
 - Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-16 Thread Liran Liss
 
  The RoCE Verbs interface references the HCA GID table in QPs and AHs,
  for all RoCE versions.
 
 The IBTA specifically does not define software interfaces.  The concern here 
 is
 the architecture and definition of the linux rdma software stack, not verbs,
 despite the fact that the layer is named that.


Whatever.
The ib_* API is the Linux implementation for IBTA Verbs (both Infiniband and 
RoCE flavors) and iWARP Verbs.
This interface uses a GID table.
In this patch-set, there is no conceptual change in this regard.
It adds GID types, as stated by the RoCE specification, which is a natural 
extension to the existing code.
I am sure that we can consider additional changes in the future, but one 
patch-set at a time...
 
  As I said earlier, the network layer (e.g., IPv4, IPv6, or GRH) cannot
  be a port attribute because RoCE HCAs support all of them over the
  same physical port.
 
 IMO, the architecture needs to change to support multiple network and
 transport layers over the same link layer.  The current restriction of 1
 transport per port is too restrictive, and doesn't seem to be true even for
 current devices, like usnic.
 

usnic uses UDP transport.
Infiniband and RoCE use the IB transport (all the BTH+ header stuff as defined 
by IBTA).
iWARP uses IETF defined protocols over TCP, namely RDMAP+DDP+MPA.

I am not aware of any device that mixes these transports today.
If we ever see such a need in the future, let's talk again. Currently, this is 
irrelevant.

  Maybe we should change these patches to encode the port summary as a
  bit mask, and provide convenient masks for queries?
 
 Switching to bit masks is the longer term objective.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Status of ummunot branch?

2013-06-10 Thread Liran Liss
Here are a few more clarifications:

1) ODP MRs can cover address ranges that do not have a mapping at registration 
time.

This means that MPI can register in advance, say, the lower GB's of the address 
space, covering malloc's primary arena.
Thus, there is no need to adjust to each increase in sbrk().

Similarly, you can register the stack region up to the maximum size of the 
stack.
The stack can grow and shrink, and ODP will always use the current mapping.

2) Virtual addresses covered by an ODP MR must have a valid mapping when they 
are is accessed (during send/receive WQE processing or as a target of an 
RDMA/atomic operation).
So, Jeff, the only thing you need to make sure is that you don't free() a 
buffer that you posted and haven't got a completion yet - but I guess that this 
is something that you already do... :)

For example, in the following scenario:
a. reg_mr(first GB of the address space)

b. p = malloc()
c. post_send(p)
d. poll for completion
e. free(p)

f. p = malloc()
g. post_send(p)
h. poll for completion
i. free(p)

(c) may incur a page fault (if not pre-fetched or faulted-in by another thread).
(e) happens after the completion, so it is guaranteed that (c), when processed 
by HW, uses the correct application buffer with the current virt-to-phys 
mapping (at HW access time)

The reallocation may or may not change the virtual-to-physical mappings.
The message may or may not be paged out (ODP does not hold a reference on the 
page).
In any case, when (g) is processed, it always uses the current mapping.

--Liran



-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe
Sent: Saturday, June 08, 2013 2:58 AM
To: Jeff Squyres (jsquyres)
Cc: Haggai Eran; Or Gerlitz; linux-rdma@vger.kernel.org; Shachar Raindel
Subject: Re: Status of ummunot branch?

On Fri, Jun 07, 2013 at 10:59:43PM +, Jeff Squyres (jsquyres) wrote:

  I don't think this covers other memory regions, like those added via mmap, 
  right?
  
 We talked about this at the MPI Forum this week; it doesn't seem like 
 ODP fixes any MPI problems.

ODP without 'register all address space' changes the nature of the problem, and 
fixes only one problem.

You do need to cache registrations, and all the tuning parameters (how much do 
I cache, how long do I hold it for, etc, etc) all still apply.

What goes away (is fixed) is the need for intercepts and the need to purge 
address space from the cache because the backing registration has become 
non-coherent/invalid. Registrations are always coherent/valid with ODP.

This cache, and the associated optimization problem, can never go away. With a 
'register all of memory' semantic the cache can move into the kernel, but the 
performance implication and overheads are all still present, just migrated.

 2. MPI still has to intercept (at least) munmap().

Curious to know what for? 

If you want to prune registrations (ie to reduce memory footprint), this can be 
done lazyily at any time (eg in a background thread or something). Read 
/proc/self/maps and purge all the registrations pointing to unmapped memory. 
Similar to garbage collection.

There is no harm in keeping a registration for a long period, except for the 
memory footprint in the kernel.

 3. Having mmap/malloc/etc. return new memory that may already be 
 registered because of a prior memory registration and subsequent 
 munmap/free/etc. is just plain weird.  Worse, if we re-register it, 
 ref counts could go such that the actual registration will never 
 actually expire until the process dies (which could lead to processes 
 with abnormally large memory footprints, because they never actually 
 let go of memory because it's still registered).

This is entirely on the registration cache implementation to sort out, there 
are lots of performance/memory trade offs.

It is only weird when you think about it in terms of buffers. memory 
registration has to do with address space, not buffers.

 What MPI wants is:
 
 1. verbs for ummunotify-like functionality 2. non-blocking memory 
 registration verbs; poll the cq to know when it has completed

To me, ODP with an additional 'register all address space' semantic, plus an 
asynchronous prefetch does both of these for you.

1. ummunotify functionality and caching is now in the kernel, under
   ODP. RDMA access to an 'all of memory' registration always does the
   right thing.
2. asynchronous prefetch (eg as a work request) triggers ODP and
   kernel actions to ready a subset of memory for RDMA, including
   all the work that memory registration does today (get_user_pages,
   COW break, etc)
   
Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to 

RE: Status of ummunot branch?

2013-06-10 Thread Liran Liss
 -Original Message-
 From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
 ow...@vger.kernel.org] On Behalf Of Jeff Squyres (jsquyres)
 Sent: Monday, June 10, 2013 5:50 PM
 To: Jason Gunthorpe
 Cc: Haggai Eran; Or Gerlitz; linux-rdma@vger.kernel.org; Shachar Raindel
 Subject: Re: Status of ummunot branch?
 
 On Jun 7, 2013, at 4:57 PM, Jason Gunthorpe
 jguntho...@obsidianresearch.com wrote:
 
  We talked about this at the MPI Forum this week; it doesn't seem like
  ODP fixes any MPI problems.
 
  ODP without 'register all address space' changes the nature of the
  problem, and fixes only one problem.
 
 I agree that pushing all registration issues out of the application and
 (somewhere) into the verbs stack would be a nice solution.
 
  You do need to cache registrations, and all the tuning parameters (how
  much do I cache, how long do I hold it for, etc, etc) all still apply.
 
  What goes away (is fixed) is the need for intercepts and the need to
  purge address space from the cache because the backing registration
  has become non-coherent/invalid. Registrations are always
  coherent/valid with ODP.
 
  This cache, and the associated optimization problem, can never go
  away. With a 'register all of memory' semantic the cache can move into
  the kernel, but the performance implication and overheads are all
  still present, just migrated.
 
 Good summary; and you corrected some of my mistakes -- thanks.
 
 That being said, everyone I've talked to about ODP finds it very, very strange
 that the kernel would keep memory registrations around for memory that is
 no longer part of a process.  Not only does it lead to the new memory is
 magically already registered semantic that I find weird, it's just plain 
 *odd*
 for the kernel to maintain state for something that doesn't exist any more.  
 It
 feels dirty.
 
 Sidenote: I was just informed today that the current way MPI
 implementations implement registration cache coherence (glibc malloc
 hooks) has been deprecated and will be removed from glibc
 (http://sourceware.org/ml/libc-alpha/2011-05/msg00103.html).  This really
 puts on the pressure to find a new / proper solution.
 
  What MPI wants is:
 
  1. verbs for ummunotify-like functionality 2. non-blocking memory
  registration verbs; poll the cq to know when it has completed
 
  To me, ODP with an additional 'register all address space' semantic,
  plus an asynchronous prefetch does both of these for you.
 
  1. ummunotify functionality and caching is now in the kernel, under
ODP. RDMA access to an 'all of memory' registration always does the
right thing.
 
 Register all address space is the moral equivalent of not having userspace
 registration, so let's talk about it in those terms.  Specifically, there's a 
 subtle
 difference between:
 
 a) telling verbs to register (0...2^64)
-- Which is weird because it tells verbs to register memory that isn't in 
 my
 address space


Another way to look at it is specify IO access permissions for address space 
ranges.
This could be useful to implement a buffer pool to be used for a specific MR 
only, yet still map/unmap memory within this pool on the fly to optimize 
physical memory utilization.
In this case, you would provide smaller ranges than 2^64...


 b) telling verbs that the app doesn't want to handle registration
-- How that gets implemented is not important (from userspace's point of
 view) -- if the kernel chooses to implement that by registering non-existent
 memory, that's the kernel's problem
 
 I guess I'm arguing that registering non-existent memory is not the Right
 Thing.
 
 Regardless of what solution is devised for registered memory management
 (ummunotify, ODP, or something else), a non-blocking verb for registering
 memory would still be a Very Useful Thing.
 
 --
 Jeff Squyres
 jsquy...@cisco.com
 For corporate legal information go to:
 http://www.cisco.com/web/about/doing_business/legal/cri/
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in the
 body of a message to majord...@vger.kernel.org More majordomo info at
 http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net-next V0 18/21] mlx4_core: adjust catas operation for SRIOV mode

2011-12-04 Thread Liran Liss
 On Fri, Dec 2, 2011 at 2:19 AM, Yevgeny Petrilin
 yevge...@mellanox.co.il wrote:
  When running in SRIOV mode, driver should not automatically
 start/stop
  the mlx4_core upon sensing an HCA internal error -- doing this
 disables/enables
  sriov, which will cause the hypervisor to hang if there are running
 VMs with
  attached VFs.
 
 Not sure I understand this -- what happens if the driver doesn't reset
 the device
 after a catastrophic error?  Surely all the VFs are pretty screwed at
 that point?
 
 Which hypervisor are we talking about here anyway?
 
  - R.

Indeed, if you don't reset the device after an internal error, the PF/VF might 
not work properly unless you reload the PF driver.

However, invoking the SW reset by the PF (upon detecting an internal error) 
currently affects the SRIOV capability on the PCI so we cannot do it before 
disabling SRIOV first.
Since the driver is not in charge of passing-through VFs to VMs, it cannot 
disable SRIOV either so we cannot reset the device while using SRIOV.
Note that the single-function behavior is not modified.

We intend to post a fix for this in a different patch-set.
--Liran

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: When IBoE will be merged to upstream?

2010-07-20 Thread Liran Liss
   Small correction needed regarding the multicast forwarding.
   Since we are talking about IPv6 multicast groups, which 
 translate to   33:33:xx:xx:xx:xx MAC address, the router 
 listener notification protocol   is going to be MLD and not 
 IGMP. Still there are switches which support   MLD 
 forwarding to prevent the network flooding.
 
 Well as I said the mapping of IBoE MGID to Ethernet address 
 is not specified.  However I agree that using the same 
 mapping as IPv6 so we end up with 33:33:... addresses makes sense.

Agreed.

 
 Yes, you are right that MLD snooping is the mechanism for 
 switches to discover IPv6 multicast group membership.  
 However for the IBoE case there is no requirement that IPv6 
 multicast group membership corresponds in any way to the IBoE 
 multicast group membership for the interface (and indeed as 
 far as I can tell from the IBoE spec, there is no requirement 
 that any IPv6 interface be configured on an IBoE port).
 
 Furthermore, even if an IBoE interface sends MLD messages for a given
 IPv6 group, there is no requirement that a switch use the 
 membership information for that group to forward multicast 
 packets with a non-IPv6 ethertype.

Right. Initially there can be flooding within the VLAN. In the future we can
evolve to use a group-membership protocol when customers that care about
the efficiency drive their switch vendors to support it.

Are there any other issues that you would like us to address before updating 
the patches?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: When IBoE will be merged to upstream?

2010-07-15 Thread Liran Liss
   The text is saying that the specification does not use any of the 
   LID fields in the verbs interface, that is it. It isn't talking 
   about MAC addresses.
   
   Exactly how and where the MAC address comes about was 
 never decided, 
   and at least some participants thought it should be a 1:1 
   algorithmic mapping from the GID.
   
   Ditto for VLANs, how and where the vlan tag comes about 
 is not part 
   of the spec.
 
  You are trying to rewrite history.
  Read the spec, address handles fields are fixed.
 
 Not really, this was all discussed on this list before the 
 IBxoE working group was formed,

The paragraph above is about the RoCE spec. And *this list* did not write the 
RoCE spec.

 it was discussed in the 
 working group,

The RoCE spec adopts the verbs defined in the base IB spec and does not add any 
new input modifiers to the AH verb. You may not agree with it but that does not 
change the spec.

 I objected to the draft spec leaving this area 
 absent, even.

You should submit a comment on this matter using the IBTA comment tracker 
database if you intend your concern to be taken into account.

 The spec doesn't say squat about how MAC and 
 VLAN values get into the AH,

True. The spec does not say it because there are no MAC and VLAN input 
modifiers to the create AH verb. The spec assumes the resolution from the L3 
address happens below the channel interface.

 and you have already heard how 
 my opinion on this subject differs from others.

I never attempted to misrepresent your opinion. I am just pointing out what the 
RoCE spec says.

 
   But, even if we do get there some day then we could extend the AH.
  
  This is unacceptable - we are not going to add another L3 
 identifier.
 
 It wouldn't be adding another L3 itentifier it would be an L2 
 next hop MAC address for the router. It would be nice to do 
 this from the start but if growing the AH is really that 
 scary then it should wait until someone figures out how to 
 solve the lossless routing problem on ethernet.

Augmenting the AH has a significant cost. There is a tradeoff here between 
preserving the verbs api vs. dealing with the implementation challenges 
associated with doing address resolution below the verbs. The RoCE spec 
deliberately chooses one direction. You seem to favor the other one. But in the 
interest of progress and since we all seem to agree on the way things work when 
we use link local GIDs, let us move forward with that approach for now. And we 
can get back to non local GIDs later.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: When IBoE will be merged to upstream?

2010-07-15 Thread Liran Liss
   But, we can't mandate an overload of the GID in a way that 
 it   prevents its use as a true L3 address (eventually routable).
 
 Actually I'm beginning to think that the only possible way we 
 can use the GID in IBoE is as a link-local IPv6 addresses 
 containing an Ethernet address.  Trying to hide neighbour 
 discovery or ARP below the verbs doesn't seem workable -- 
 being forced to change the locking rules we've had for the 
 past 5+ years about create_ah is just the beginning.  We get 
 further problems if a remote address should ever change and 
 I'm probably missing other issues.
 

We believe the problems are workable. But let us stop arguing for a while and
make progress with link local addressees since we all seem to agree with that.
We can get back to non-local GIDs later

 So the best solution I can see is to declare that an IBoE GID 
 must be an
 IPv6 address coming from an EUI-64 Ethernet address for the 
 corresponding port; for MGIDs I guess we use the standard 
 IPv6 mapping to Ethernet address 33:33:xx:xx:xx:xx.
 
 I'm not sure how we want to handle IPv4 -- presumably unicast 
 ARP can be done within the RDMA CM, which will then create a 
 DGID with the appropriate Ethernet address.  However it's not 
 clear to me whether we need a way to create IPv4 
 (01:00:5e:xx:xx:xx) multicast addresses.
 
 Also, since there is no way to map a link-local IPv6 address 
 to a particular interface, then I guess we need a way to pass 
 in the VLAN tag to be used -- presumably we can steal some 
 other field for the 12 bits.
 (The fact that the IBoE annex does not mention VLANs or 
 802.1q a single time is just another thing that shows how 
 rushed and incomplete it is)
 
 With all this said, I think it means we do not need to do the 
 mapping from GID to Ethernet address in the kernel for IBoE 
 user verbs, since it is so simple -- we can simply add a 
 fairly trivial helper to libibverbs.
 
  - R.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: When IBoE will be merged to upstream?

2010-07-15 Thread Liran Liss
   A quibble about multicast - AFAIK this is unsolved. I 
 think some spec   needs to be agreed that documents what 
 sort of multicast snooping   operations switches need to do, 
 ie if IGMP joins imply that IBoE   traffic for the same DMAC 
 is included in the join, or if IBoE requires   a seperate 
 IGMP type process on its own ether-type. That would make it  
  much clearer what to do with MGIDs.


It would be quite naïve to require *new* snooping functionality in Eth 
switches. Some switches will gracefully apply to non-ip traffic the filtering 
information acquired through IGMP snooping. And some will just flood non-ip MC 
frames within the corresponding VLAN which is benign (e.g. that is the way FIP 
works). A cleaner solution would be based on MMRP but that, AFAIK, is not very 
widely deployed so it is less practical at this stage. 
 
 I agree -- the current spec is rather broken for multicast.  
 Choosing a different ethertype and then saying that all 
 switches will just flood multicast traffic is half-baked at best.


It is a realistic approach. Do you claim that there are switches that will not 
forward the packets?
 
   It would be nice to at least have a plan on how to 
 integrate a   non-link local address, if that is ever 
 necessary in future. An   extended AH with an additional 48 
 DMAC field seems reasonable to me?
 
 You mean have a next-hop destination + a final destination?  
 Could be done I guess.  But I'm not sure how having a routing 
 table where you have to look up 48-bit Ethernet addresses is 
 all that different from just having a standard Ethernet 
 forwarding table.

I guess Jason suggests regarding the GID as a true L3 address and using a new 
added L2 field for the next hop L2 address.

 
 I suppose something based on MAC-in-MAC (a la 802.1ah) could 
 be done but to be honest the IBoE spec that the IBTA came up 
 with looks rather broken for routing.

Routing is out of the scope of the current RoCE spec.
And I do not see how .1ah would be relevant for this purpose.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: When IBoE will be merged to upstream?

2010-07-13 Thread Liran Liss
  ...A verbs consumer using a RoCE network relies strictly
   on so-called
Layer 3 addressing (GIDs); layer 2 addresses (e.g. subnet local
identifiers) are not passed across the verbs interface...
   
   Ah, hmm, well, I was on that list during this time and I don't
  think
   this statement means what you are saying it does  :)
  
 
  ?? It doesn't get any clearer than this.
   
 'subnet local identidifer' == LID
 
 The text is saying that the specification does not use any of the LID
 fields in the verbs interface, that is it. It isn't talking about MAC
 addresses.
 
 Exactly how and where the MAC address comes about was never decided,
 and at least some participants thought it should be a 1:1 algorithmic
 mapping from the GID.
 
 Ditto for VLANs, how and where the vlan tag comes about is not part of
 the spec.
 

You are trying to rewrite history.
Read the spec, address handles fields are fixed.

  Good idea! This is exactly what we do today for addresses that the
  user explicitly declares as link-local addresses.  But, we can't
  mandate an overload of the GID in a way that it prevents its use as
  a true L3 address (eventually routable).
 
 We are very unlikely to see routable IBoE, ever..

Says who?

 But, even if we do get there some day then we could extend the AH.

This is unacceptable - we are not going to add another L3 identifier.
 
 BTW, I absolutely hate the mixing of 'Sometimes it is a IPv4,
 sometimes it is a GID, and sometimes it is an IPv6' in the same
 field. That is just so nasty. The GID is a GID, don't overload it in
 an ambiguous way to mean 2 other things!

A GID is a GID indeed -- in a RoCE environment, it's the layer 3 identifier.
All of our intended values are standard ipv6 encapsulations.

   create_ah does not accept any sort of source address 
   specifier
  
  You are wrong -- sgid_index specifies it.
 
 So, what do you propose to put in sgid_index? It isn't big enough to
 store an IPv6 address. You can't exactly number every IP assigned to
 every ethernet interface.

An iboe device is associated with a specific Ethernet interface. Thus, its gid 
table
only needs to map the ip addresses assigned to that interface.

 The other fields you mention are not a supserset of socket parameters,
 they are only IPv6 parameters, IPv4 uses a different set.

Like what?

  Jason, bottom line, I think that we both agree that the rdmacm
  should do the address resolution.  The difference is that by having
  the rdmacm initially only bind to the device and complete the
  resolution later (by a call from create_ah()), we don't change the
  user API for *all* gid types.  Having addressed your concerns
  regarding resolution below the Verbs, we continue to believe that
  this is the best approach.
 
 Again, I don't see how what I've outlined changes the API in any
 way.

We currently support link-local gids, but the architecture must not limit the 
scope.

 
 Doing two routing lookups for the same connection is bad design, it is
 racey. L2 parameters have to flow from the first routing lookup in
 RDMA-CM to everything else.
 

So is caching L3--L2 mappings that change a second later...
So what?

 Liran, I don't think you have at all come close to addressing my
 concerns, you still haven't explained how a full route lookup is even
 possible in create_ah, for instance. Let alone my other concerns!
 
 Jason
 --
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: When IBoE will be merged to upstream?

2010-07-10 Thread Liran Liss
This discussion has derived into whether we need to expose Eth L2 params across 
the Verbs interface.
This point has been discussed extensively in the IBTA during the development of 
the RoCE spec. And the direction chosen by the spec is clear:

...A verbs consumer using a RoCE network relies strictly on so-called Layer
3 addressing (GIDs); layer 2 addresses (e.g. subnet local identifiers) are not 
passed across the verbs interface...

The motivation behind this direction had to do with preserving API transparency 
for applications.
The goal was to allow existing and future applications to run above RoCE and 
native IB without changes. 

As opposed to what it may seem at first sight, adding Eth L2 parameters to the 
address vector, *does not* make RoCE closer to IB.
It actually goes the other way around.
Here is a quick list of what would have to be changed if we were to include Eth 
L2 address parameters to the Address Vector and other structures/functions that 
expose L2 params:

Structure changes:
- ibv_wc
- ibv_ah_attr
-- ibv_qp_attr
-- rdma_ud_param
--- rdma_cm_event
- ibv_port_attr

Verb API changes:
- ibv_poll_cq()
- ibv_init_ah_from_wc()
- ibv_create_ah()

- ibv_query_qp()
- ibv_modify_qp()

- ibv_query_port()

- ibv_attach_mcast()
- ibv_detach_mcast()

rdmacm API changes:
- rdma_post_ud_send()
- rdma_get_cm_event()
- rdma_ack_cm_event()
 
As a result of this:
- Existing IB binaries would cease working over RoCE.
- Due to added fields in structures, even just recompiling existing 
applications from source would be problematic.
- To make future applications work on both ib and RoCE transparently, you would 
need additional wrappers such as init_ah(), copy_ah(), and ah_is_equal(), and 
never inspect address handle fields directly.

So why introduce differences between RoCE and IB (for the Application 
writers!!) when they *aren't* needed? Using rdmacm won't solve this either (UD 
traffic).
By following the direction set forth by the RoCE spec none of this is required. 
Existing (rdmacm) application binaries do run over RoCE or IB unchanged.

Granted, the RoCE spec approach introduces 2 *implementation* issues that we 
need to tackle:
1. Address resolution, which is a generic function, should not be a 
device-specific call.
In this matter, we already proposed a solution where resolution is done, as 
required, in generic functions in the kernel.
Specifically, we provide L2 information to user-space drivers via create_ah(), 
avoiding the need to add a new ABI call altogether, while the resolution would 
take place in a generic CMA routine.

2. The Kernel currently assumes that create_ah() can execute in atomic context.
One option is to distinguish between the create_ah() calls (in the kernel) that 
are done for iboe, which are very few, and the rest of the calls that are 
ib-only.
There are other approaches to solve this as well.

It seems clear that our goal should be to solve these issues inside the kernel, 
in the cleanest manner as possible, while preserving transparency to the 
applications.

Comments are welcome.

Liran 

 -Original Message-
 From: Or Gerlitz [mailto:ogerl...@voltaire.com] 
 Sent: Wednesday, July 07, 2010 9:00 AM
 To: Liran Liss
 Cc: Roland Dreier; Jason Gunthorpe; Hefty, Sean; Aleksey 
 Senin; linux-rdma; mo...@voltaire.com; aleks...@voltaire.com; 
 yift...@voltaire.com; Tziporet Koren; al...@voltaire.com
 Subject: Re: When IBoE will be merged to upstream?
 
 Liran Liss wrote:
  but keeping ib_create_ah() callable from any context is not 
 a goal by itself.
 
 going with your approach, if your proposed design is 
 accepted, I believe that you probably need to patch all the 
 code-chains that makes calls under the current assumption
 
  I am looking for constructive ideas for supporting iboe without 
  breaking Verbs/CQE/CM syntax.
 
 I don't agree that exposing the Ethernet L2 related 
 information to the caller is breaking something, the 
 converse, it is a required enhancement. 
 
 I think we need to let resolve through the rdma-cm  get to 
 know at the consumer level, what are the source / destination 
 macs, vlan id and vlan priority used by an IBoE QP, in the 
 exact manner all the IB equivalents (src/dst lid, pkey, sl) 
 are resolved by the rdma-cm and exposed to the consmer app for IB QP.
 
 Or.
 
 --
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: When IBoE will be merged to upstream?

2010-06-25 Thread Liran Liss
VLANs are part of L2 in Ethernet -- when you resolve a destination L3 address 
to an L2 address, you get the outgoing interface, which also determines the 
VLAN.
I think this approach has an advantage over an RDMA device per VLAN in that you 
keep the standard OS VLAN management (vconfig).

I wouldn't judge the RoCE spec so quickly --- it guarantees that rdma 
application binaries could run on any network.
What do you gain by exposing Eth-specific L2 params in the address handle?
--Liran


-Original Message-
From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com] 
Sent: Thursday, June 24, 2010 11:37 PM
To: Liran Liss
Cc: Hefty, Sean; Roland Dreier; Aleksey Senin; linux-rdma; mo...@voltaire.com; 
aleks...@voltaire.com; yift...@voltaire.com; Tziporet Koren; al...@voltaire.com
Subject: Re: When IBoE will be merged to upstream?

 The current behavior of ibv_create_ah() requires that the caller 
 provide the L2, and if needed, L3 addressing.  Any translation between 
 the L3 and L2 addressing must be done before the call is made.  E.g. 
 ibv_create_ah does not use the GID to query the SA to obtain LIDs.  
 Why doesn't IBoE follow this same model?
 
 LL: because of the RoCE spec, which states that only GID addressing is 
 used at the Verbs level. The address handle fields are unchanged, and 
 the L2 fields (e.g., lid) are reserved.  Note that in Ethernet, you 
 normally don't specify L2 addresses at the transport level (i.e., 
 sockets).

We do not have to lavishly follow the IBTA spec in the Linux implementation, 
especially if it makes no sense.

I think Sean is on the right track here, the AH should take the L2 as input 
just like for IB, and the resolution is done in librdmacm, or somehow manually.

The verbs layer is not really analogous to sockets anyhow, the librdmacm is 
much closer to a socket like interface, and it having a GID go into rdmacm and 
a full AH with L2/L3 info come out seems entirely reasonable.

BTW, what ever was decided about vlans tagging? Is that part of the AH or do 
you use seperate RDMA devices per vlan? Seems like a point worth considering 
now.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: When IBoE will be merged to upstream?

2010-06-24 Thread Liran Liss
Regarding GID to Eth mappings, we discussed using the standard create_ah() Verb 
for this.
In the kernel, create_ah() will call a generic address resolution function in 
the cma.
The returned information will be copied back to user-space in a device-specific 
structure (since address handles are device-specific).

This eliminates adding a new get_eth_l2_addr() ABI call (device-specific or 
not).
In fact, this approach eliminates adding new ABIs for any kind of address 
translation...

Does this seem reasonable?
--Liran


-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Roland Dreier
Sent: Thursday, June 24, 2010 2:24 AM
To: Aleksey Senin
Cc: linux-rdma; mo...@voltaire.com; aleks...@voltaire.com; 
yift...@voltaire.com; Tziporet Koren; al...@voltaire.com
Subject: Re: When IBoE will be merged to upstream?

  This is actually a continue of the RAW_ET()  issue. We want to   make a 
  submition  of the patches to the upstream, but there is not   support for 
  IB transport in Ethernet devices, and the mlx4_en drivers   version  is a 
  bit outdated 1.4.1.1 in upstream and 1.5.1 in the OFED   There is also 
  missing VLAN support that already present in the OFED.
  When do you planning to submit changes from OFED to upstream?

 - I do not search for more things to merge upstream.  I have enough
   work reviewing things that are sent to me.  So I will never look
   through OFED for changes.

 - I do not handle the mlx4_en driver.  Changes for mlx4_en should go to
   netdev and Dave Miller.

 - I will try to get back to the IBoE changes when I have time, and I
   will admit that my time to spend as RDMA maintainer is nowhere near
   full time and less than it was in the past.

 - I did allocate a fair amount of time to spend on IBoE recently but
   unfortunately the patches were not really in a suitable state to
   merge, and I exhausted that time slice before we reached the end.
   When patch sets sit outside of the upstream kernel and are shipped in
   OFED for months and years, it would probably make upstream merging
   easier if that time was used to fix the patch set.

 - Specifically for the IBoE patches, shouldn't someone have realized
   that having a device-specific interface to do the standard mapping of
   GID to Ethernet address makes no sense?
--
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: When IBoE will be merged to upstream?

2010-06-24 Thread Liran Liss
S.B.
--Liran 

-Original Message-
From: Hefty, Sean [mailto:sean.he...@intel.com] 
Sent: Thursday, June 24, 2010 9:06 PM
To: Liran Liss; Roland Dreier; Aleksey Senin
Cc: linux-rdma; mo...@voltaire.com; aleks...@voltaire.com; 
yift...@voltaire.com; Tziporet Koren; al...@voltaire.com
Subject: RE: When IBoE will be merged to upstream?

 Regarding GID to Eth mappings, we discussed using the standard 
 create_ah() Verb for this.
 In the kernel, create_ah() will call a generic address resolution 
 function in the cma.
 The returned information will be copied back to user-space in a 
 device- specific structure (since address handles are device-specific).
 
 This eliminates adding a new get_eth_l2_addr() ABI call 
 (device-specific or not).
 In fact, this approach eliminates adding new ABIs for any kind of 
 address translation...
 
 Does this seem reasonable?

The current behavior of ibv_create_ah() requires that the caller provide the 
L2, and if needed, L3 addressing.  Any translation between the L3 and L2 
addressing must be done before the call is made.  E.g. ibv_create_ah does not 
use the GID to query the SA to obtain LIDs.  Why doesn't IBoE follow this same 
model?

LL: because of the RoCE spec, which states that only GID addressing is used at 
the Verbs level. The address handle fields are unchanged, and the L2 fields 
(e.g., lid) are reserved.
Note that in Ethernet, you normally don't specify L2 addresses at the transport 
level (i.e., sockets).

Callers can use some out of band mechanism for the mapping, call 
rdma_resolve_addr, or use some standard networking routine.
LL: this would require changes to the Verbs API. rdmacm programs addresses 
using user-space Verbs, or even just passes the application just address handle 
attributes...
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCHv8 07/11] ib_core: Add API to support IBoE from userspace

2010-05-17 Thread Liran Liss
If we have a dedicated ABI call for this mapping, then it seems reasonable to 
have it device independent.
However, this mapping is really only used when creating address handles.

So, we can base the mapping on the (device specific) create_ah() flow, but 
provide generic mapping functions for all devices to use (this is kind of what 
happens now).
Also, using create_ah() doesn't introduce an ABI call that is specific to 
ib--eth mappings.

This is similar to how device-specific ib_reg_user_mr() functions call the 
generic ib_umem_get()...

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Roland Dreier
Sent: Thursday, May 13, 2010 10:18 PM
To: Sean Hefty
Cc: 'Eli Cohen'; Eli Cohen; Linux RDMA list; ewg
Subject: Re: [PATCHv8 07/11] ib_core: Add API to support IBoE from userspace

  Basically, what I want to understand is why does this change make sense?
 
  @@ -1139,6 +1139,10 @@ struct ib_device {
 struct ib_grh *in_grh,
 struct ib_mad *in_mad,
 struct ib_mad *out_mad);
  +int(*get_eth_l2_addr)(struct ib_device *device,
  u8 port,
  +  union ib_gid *dgid, int
  sgid_idx,
  +  u8 *mac, u16 *vlan_id, u8
  *tagged);
  +

Yes, that was pretty much my original question.  Why do we have a verb for 
userspace to call a device-specific method to do the mapping?  The layering 
seems wrong somewhere if we have a generic verb to do this mapping, but then 
put the mapping in device-specific code.

 - R.
--
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RDMAoE verbs questions

2009-12-03 Thread Liran Liss
 Existing apps rely on transport_type == IBV_TRANSPORT_IB to indicate
IB management is present. There are many examples of this.
 The art of API compatability is to not break existing old apps, so you
don't get to change the meaning of transport_type == IBV_TRANSPORT_IB to
mean 'it is only IB verbs like'. That breaks the API.
 Adding a new field to port_attr preserves functionality but not
compatability. I hope you understand the difference.

I understand exactly what you mean, but I want to propose another way of
looking at the compatibility issue:

IB management is a network service. Just as an administrator might
mistakenly try to access an FTP server over a wrong eth interface, an IB
admin can mistakenly run an IB management application (or a non-rdmacm
app that is incompatible with RDMAoE) on a RDMAoE port. In both cases,
the service is unreachable, but otherwise no harm is done.
Basically, IB management is supported by RDMAoE --- you use the same
verbs to access it but you don't get any answers...
(Of course that it terms of implementation, we can choose to drop SMPs
or non-CM MADs rather then sending them on the wire.)

There is a tradeoff here between transparently supporting existing
rdmacm apps versus making sure that non-rdmacm apps that may come across
an RDMAoE port do not attempt to use it. 
Given the fact that rdmacm has become a preferred approach, and that for
non-rdmacm apps the worst case effect is that of being unable to create
a connection through a port that they were not designed to support to
begin with, we prefer the approach that we proposed.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RDMAoE verbs questions

2009-12-02 Thread Liran Liss

 So? There are substantial semantic differences for *all* non-rdmacm
applications. Even common ones like OpenMPI. You propose to ignore them?

On the contrary! Any application that *does* care what the link layer is
can look up a new field in port_attr (rather than a new node transport
type).
Applications that don't, both old and new, can continue as normal - no
changes to the code are required.

 RDMAoE *is* IB transport over Ethernet - we don't want different 
 devices with different node types exactly for this reason: 
 applications shouldn't care if they are running over IB or RDMAoE,
and 
 shouldn't add another switch statement to support RDMAoE.

 Nonsense. RDMAoE is no such thing, it is utterly incompatible with the
IB management model. It is some new protocol that is only about 90%
compatible with IB.

You are missing the point here - RDMAoE is 100% compatible with IB at
the *transport* level, as reflected by the Verbs.
The point that the management model is different is true, but
irrelevant. The only transport-related issue that matters is addressing,
but for user-space apps, it is completely abstracted by the rdmacm.

Non-rdmacm apps fall into 2 main categories:
1. IB management+diagnostics apps - these are irrelevant for an Eth
network anyway.
2. High-performance middleware such as MPI and SHMEM - these perform
optimizations according to the link protocol anyway.

So, all relevant apps will work great with either IB or RDMAoE, in a
transparent manner.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RDMAoE verbs questions

2009-12-02 Thread Liran Liss
Hi Paul, you are not missing anything - lookback communication will work
in RDMAoE just as in IB.
--Liran

-Original Message-
From: Paul Grun [mailto:pg...@systemfabricworks.com] 
Sent: Wednesday, December 02, 2009 10:55 AM
To: 'Or Gerlitz'; Liran Liss
Cc: 'Sean Hefty'; 'Jason Gunthorpe'; 'Eli Cohen'; 'Jeff Squyres';
linux-rdma@vger.kernel.org
Subject: RE: RDMAoE verbs questions

Why do you say that Or?  
I'm a hardware guy so can't comment on what the s/w supports/prevents,
but I can see no reason why loopback wouldn't be supported.  In fact, if
I were guessing, I would guess that the spec currently under development
would support loopback.tt What am I missing?
-Paul

-Original Message-
From: linux-rdma-ow...@vger.kernel.org
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Or Gerlitz
Sent: Wednesday, December 02, 2009 12:09 AM
To: Liran Liss
Cc: Sean Hefty; Jason Gunthorpe; Eli Cohen; Jeff Squyres;
linux-rdma@vger.kernel.org
Subject: Re: RDMAoE verbs questions

Liran Liss wrote:
 from an rdmacm app's point of view - there is no visible difference
between IB and RDMAoE ports: both support the complete set of Verbs,
just as any IB transport provider
   
wrong,  local (loopback) communication aren't supported  with RDMAoE.

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org More majordomo info
at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html