Re: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-19 Thread Hal Rosenstock
On 6/18/2015 5:00 PM, Doug Ledford wrote:
> There is *zero* functional difference between node_type == OPA or node_type 
> == IB_CA and link_layer == OPA. 
> An application has *exactly* what they need

We have neither of these things in the kernel today. Also, if I interpreted 
what was written by first Ira and more recently Sean, even if either of these 
things were done, the user space provider library for OPA might/could just 
change these back to the IB types.

-- Hal
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in


Re: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-18 Thread Doug Ledford

> On Jun 16, 2015, at 5:05 PM, Liran Liss  wrote:
> 
>> From: Doug Ledford [mailto:dledf...@redhat.com]
> 
>>> No. RoCE is as an open standard from the IBTA with the exact same RDMA
>> protocol semantics as InfiniBand and a clear set of compliancy rules without
>> which an implementation can't claim to be such. A RoCE device *is* an IB CA
>> with an Ethernet link.
>>> In contrast, OPA is a proprietary protocol. We don't know what primitives
>> are supported, and whether the semantics of supported primitives are the
>> same as in InfiniBand.
>> 
>> Intel has stated on this list that they intend for RDMA apps to run on
>> OPA transparently.  That pretty much implies the list of primitives and
>> everything else that they must support.  However, time will tell if they
>> succeeded or not.
>> 
> 
> I am sorry, but that's not good enough.
> When I see an IB device, I know exactly what to expect. I can't say anything 
> regarding an OPA device.
> 
> It might be that today the semantics are "close enough".
> But in the future, both feature sets and semantics may diverge considerably.
> What are you going to do then?
> 
> In addition, today, the host admin knows that 2 IB CA nodes will always 
> interoperate. If you share the node type with OPA, everything breaks down. 
> There is no way of knowing which devices work with which.

You’ve not done yourself any favors with this argument.  You’ve actually 
stretched yourself into the land of hyperbole and FUD in order to make this.  
Do you not see that “2 IB CA nodes will always interoperate” is not true as 
soon as you consider differing link layer types?  For example, an mlx4_en 
device will not interoperate with a qib device, yet they are both IB_CA node 
types.  Conflating allowing an OPA device to be node type IB_CA and link layer 
OPA to everything breaking down is pure and utter rubbish.  And with that, we 
are done with this discussion.  I’ve detailed what my litmus test will be, and 
I’m sticking with exactly that.

In the case of iWARP and usNIC, there are significant differences from an IB_CA 
that render a program responsible for possibly altering its intended transfer 
mechanism significantly (for instance usNIC is UD only, iWARP can’t do atomics 
or immediate data, so any transfer engine design that uses either of those is 
out of the question).  On the other hand, everything that uses IB_CA supports 
the various primitives and only vary in their addressing/management.  If OPA 
stays true to that (and it certainly does so far by supporting the same verbs 
as qib), then IB_CA/link_layer OPA is perfectly acceptable and in fact 
preferred due to the fact that it will produce the minimum amount of change in 
user space applications before they can support the OPA devices.

>> So this will be my litmus test.  Currently, an app that supports all of
>> the RDMA types looks like this:
>> 
>> if (node_type == RNIC)
>>  do iwarpy stuff
>> else if (node_type == USNIC)
>>  do USNIC stuff
>> else if (node_type == IB_CA)
>>  do IB verbs stuff
>>  if (link_layer == Ethernet)
>>  do RoCE addressing/management
>>  else
>>  do IB addressing/management
>> 
>> 
>> 
>> If, in the end, apps that are modified to support OPA end up looking
>> like this:
>> 
>> if (node_type == RNIC)
>>  do iwarpy stuff
>> else if (node_type == USNIC)
>>  do USNIC stuff
>> else if (node_type == IB_CA || node_type == OPA_CA)
>>  do IB verbs stuff
>>  if (node_type == OPA_CA)
>>  do OPA addressing/management
>>  else if (link_layer == Ethernet)
>>  do RoCE addressing/management
>>  else
>>  do IB addressing/management
>> 
>> where you can plainly see that the exact same goal can be accomplished
>> whether you have an OPA node_type or an IB_CA node_type + OPA
>> link_layer, then I will be fine with either a new node_type or a new
>> link_layer.  They will be functionally equivalent as far as I'm concerned.
>> 
> 
> It is true that for some applications, your abstraction might work 
> transparently.
> But for other applications, your "do IB verbs stuff" (and not just the 
> addressing/management) will either break today or break tomorrow.

FUD.  Come to me when you have a concrete issue and not hand-wavy scare 
mongering.

> This is bad both for IB and for OPA.

No, it’s not.

> Why on earth are we putting ourselves into a position which could easily be 
> avoided in the first place?
> 
> The solution is simple:
> - As an API, Verbs will support IB/ROCE, iWARP, USNIC, and OPA

There is *zero* functional difference between node_type == OPA or node_type == 
IB_CA and link_layer == OPA. An application has *exactly* what they need to do 
everything you have mentioned.  It changes the test the application makes, but 
not what the application does.

> - The node type and link type refer to specific technologies

Yes, and Intel has made it clear that they are copying IB Verbs as a 
technology.  It is c

RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-18 Thread Liran Liss
> From: Weiny, Ira [mailto:ira.we...@intel.com]

> > ib_verbs define an *extensive* direct HW access API, which is constantly
> > evolving.
> 
> This is the problem with verbs...

Huh?
It is its strength, if you don't break backward compatibility...

> 
> > You cannot describe the intricate object relations and semantics through an
> > API.
> > In addition, you can't abstract anything or fix stuff in SW.
> > The only way to *truly* know what to expect when performing Verbs calls
> is to
> > check the node type.
> 
> How can you say this?
> 
> mthca, mlx4, mlx5, and qib all have different sets of functionality... all 
> with
> the same node type.  OPA has the same set as qib...  same node type.
> 

Only that qib is IB, which is fully interoperable with mlx*

> >
> > ib_verbs was never only an API. It started as the Linux implementation of
> the
> > IBTA standard, with guaranteed semantics and wire protocol.
> > Later, the interface was reused to support additional RDMA devices.
> However,
> > you could *always* check the node type if you wanted to, thereby retaining
> the
> > standard guarantees. Win-win situation...
> 
> Not true at all.  For example, Qib does not support XRC and yet has the same
> node type as mlx4 (5)...

The node type is for guaranteeing semantics and interop for the features that 
you do implement...

> 
> >
> > This is a very strong property; we should not give up on it.
> 
> On the contrary the property is weak and implies functionality or lack of
> functionality rather than being explicit.  This was done because getting
> changes to kernel ABIs was hard and we took a shortcut with node type
> which we should not have.  OPA attempts to stop this madness and supports
> the functionality of verbs _As_ _Defined_ rather than creating yet another
> set of things which applications need to check against.
> 

I totally agree that we should improve the expressiveness and accuracy of our 
capabilities;
you don't need OPA for this. Unfortunately, it is not always the case. 

Also, there are behaviors that are not defined by the API, but still rely on 
the node type.
Management applications , for example.

--Liran


RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-17 Thread Weiny, Ira
> 
> ib_verbs define an *extensive* direct HW access API, which is constantly
> evolving.

This is the problem with verbs...

> You cannot describe the intricate object relations and semantics through an
> API.
> In addition, you can't abstract anything or fix stuff in SW.
> The only way to *truly* know what to expect when performing Verbs calls is to
> check the node type.

How can you say this?

mthca, mlx4, mlx5, and qib all have different sets of functionality... all with 
the same node type.  OPA has the same set as qib...  same node type.

> 
> ib_verbs was never only an API. It started as the Linux implementation of the
> IBTA standard, with guaranteed semantics and wire protocol.
> Later, the interface was reused to support additional RDMA devices. However,
> you could *always* check the node type if you wanted to, thereby retaining the
> standard guarantees. Win-win situation...

Not true at all.  For example, Qib does not support XRC and yet has the same 
node type as mlx4 (5)...

> 
> This is a very strong property; we should not give up on it.

On the contrary the property is weak and implies functionality or lack of 
functionality rather than being explicit.  This was done because getting 
changes to kernel ABIs was hard and we took a shortcut with node type which we 
should not have.  OPA attempts to stop this madness and supports the 
functionality of verbs _As_ _Defined_ rather than creating yet another set of 
things which applications need to check against.

> 
> >
> > You're right that apps can be coded to other CA types, like RNICs and
> > USNICs.  However, those are all very different from an IB_CA due to
> > limited queue pair types or limited primitives.  If OPA had that same
> > limitation then I would agree it needs a different node type.
> >
> 
> How do you know that it doesn't?

Up until now you have had to take my word for it.  Now that the driver has been 
posted it should be clear what verbs we support (same as qib).

Ira



RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-16 Thread Hefty, Sean
> You're right that apps can be coded to other CA types, like RNICs and
> USNICs.  However, those are all very different from an IB_CA due to
> limited queue pair types or limited primitives.  If OPA had that same
> limitation then I would agree it needs a different node type.
> 
> So this will be my litmus test.  Currently, an app that supports all of
> the RDMA types looks like this:
> 
> if (node_type == RNIC)
>   do iwarpy stuff
> else if (node_type == USNIC)
>   do USNIC stuff
> else if (node_type == IB_CA)
>   do IB verbs stuff
>   if (link_layer == Ethernet)
>   do RoCE addressing/management
>   else
>   do IB addressing/management

The node type values were originally defined to align with the IB management 
NodeInfo structure.  AFAICT, there was no intent to associate those values with 
specific functionality or addressing or verbs support or anything else, really, 
outside of what IB management needed.

iWarp added a new node type, so that the IB management code could ignore those 
devices.  RoCE basically broke this association by forcing additional checks in 
the local management code to also check against the link layer.  The recent mad 
capability bits are a superior solution, making the node type obsolete. 

At this point, the node type essentially indicates if we start counting ports 
at a numeric value 0 or 1.  The NodeType that an OPA channel adapter will 
report in a NodeInfo structure will be 1, the same value as if it were an IB 
channel adapter.

In the end, this argument matters one iota.  The kernel code barely relies on 
the node type, and a user space verbs provider can report whatever value it 
wants.

- Sean


RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-16 Thread Liran Liss
> From: Doug Ledford [mailto:dledf...@redhat.com]

> > No. RoCE is as an open standard from the IBTA with the exact same RDMA
> protocol semantics as InfiniBand and a clear set of compliancy rules without
> which an implementation can't claim to be such. A RoCE device *is* an IB CA
> with an Ethernet link.
> > In contrast, OPA is a proprietary protocol. We don't know what primitives
> are supported, and whether the semantics of supported primitives are the
> same as in InfiniBand.
> 
> Intel has stated on this list that they intend for RDMA apps to run on
> OPA transparently.  That pretty much implies the list of primitives and
> everything else that they must support.  However, time will tell if they
> succeeded or not.
> 

I am sorry, but that's not good enough.
When I see an IB device, I know exactly what to expect. I can't say anything 
regarding an OPA device.

It might be that today the semantics are "close enough".
But in the future, both feature sets and semantics may diverge considerably.
What are you going to do then?

In addition, today, the host admin knows that 2 IB CA nodes will always 
interoperate. If you share the node type with OPA, everything breaks down. 
There is no way of knowing which devices work with which.

> >> The new OPA stuff appears to be following *exactly* the same
> development
> >> model/path that RoCE did.  When RoCE was introduced, all the apps that
> >> really cared about low level addressing on the link layer had to be
> >> modified to encompass the new link type.  This is simply link_layer
> >> number three for apps to care about.
> >>
> >
> > You are missing my point. API transparency is not a synonym for full
> semantic equivalence.  The Node Type doesn’t indicate level of adherence to
> an API. Node Type indicates compliancy to a  specification (e.g. wire 
> protocol,
> remote order of execution, error semantics, architectural limitations, etc).
> The IBTA CA and Switch Node Types belong to devices that are compliant to
> the corresponding specifications from the InfiniBand Trade Association.  And
> that doesn’t prevent applications to choose to be coded to run over nodes of
> different Node Type as it happens today with IB/RoCE and iWARP.
> >
> > This has nothing to do with addressing.
> 
> And whether you like it or not, Intel is intentionally creating a
> device/fabric with the specific intention of mimicking the IB_CA device
> type (with stated exceptions for MAD packets and addresses).  They
> obviously won't have certification as an IB_CA, but that's not their
> aim.  Their aim is to be a functional drop in replacement that apps
> don't need to know about except for the stated exceptions.
> 

Intensions are nice, but there is no way to define these "stated exceptions" 
apart from a specification.

> And I'm not missing your point.  Your point is inappropriate.  You're
> trying to conflate certification with a functional API.  The IB_CA node
> type is not an official certification of anything, and the linux kernel
> is not an official certifying body for anything.  If you want
> certification, you go to the OFA and the UNH-IOL testing program.
> There, you have the rights to the certification branding logo and you
> have the right to deny access to that logo to anyone that doesn't meet
> the branding requirements.

Who said anything about certification?
I am talking about present and future semantic compliance to what an IB CA 
stands for, and interoperability guarantees.

ib_verbs define an *extensive* direct HW access API, which is constantly 
evolving.
You cannot describe the intricate object relations and semantics through an API.
In addition, you can't abstract anything or fix stuff in SW.
The only way to *truly* know what to expect when performing Verbs calls is to 
check the node type.

ib_verbs was never only an API. It started as the Linux implementation of the 
IBTA standard, with guaranteed semantics and wire protocol.
Later, the interface was reused to support additional RDMA devices. However, 
you could *always* check the node type if you wanted to, thereby retaining the 
standard guarantees. Win-win situation...

This is a very strong property; we should not give up on it.

> 
> You're right that apps can be coded to other CA types, like RNICs and
> USNICs.  However, those are all very different from an IB_CA due to
> limited queue pair types or limited primitives.  If OPA had that same
> limitation then I would agree it needs a different node type.
> 

How do you know that it doesn't?
Have you seen the OPA specification?

> So this will be my litmus test.  Currently, an app that supports all of
> the RDMA types looks like this:
> 
> if (node_type == RNIC)
>   do iwarpy stuff
> else if (node_type == USNIC)
>   do USNIC stuff
> else if (node_type == IB_CA)
>   do IB verbs stuff
>   if (link_layer == Ethernet)
>   do RoCE addressing/management
>   else
>   do IB addressing/management
> 
> 
> 
> If, in th

Re: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-14 Thread Doug Ledford
On 06/14/2015 03:16 PM, Liran Liss wrote:
>> From: Doug Ledford [mailto:dledf...@redhat.com]
> 
>>> But the node_type stands for more than just an abstract RDMA device:
>>> In IB, it designates an instance of an industry-standard, well-defined,
>> device type: it's possible link types, transport, semantics, management,
>> everything.
>>> It *should* be exposed to user-space so apps that know and care what
>> they are running on could continue to work.
>>
>> I'm sorry, but your argument here is not very convincing at all.  And
>> it's somewhat hypocritical.  When RoCE was first introduced, the *exact*
>> same argument could be used to argue for why RoCE should require a new
>> node_type.  Except then, because RoCE was your own, you argued for, and
>> got, an expansion of the IB node_type definition that now included a
>> relevant link_layer attribute that apps never needed to care about
>> before.  However, now you are a victim of your own success.  You set the
>> standard then that if the new device can properly emulate an IB Verbs/IB
>> Link Layer device in terms of A) supported primitives (iWARP and usNIC
>> both fail here, and hence why they have their own node_types) and B)
>> queue pair creation process modulo link layer specific addressing
>> attributes, then that device qualifies to use the IB_CA node_type and
>> merely needs only a link_layer attribute to differentiate it.
>>
> 
> No. RoCE is as an open standard from the IBTA with the exact same RDMA 
> protocol semantics as InfiniBand and a clear set of compliancy rules without 
> which an implementation can't claim to be such. A RoCE device *is* an IB CA 
> with an Ethernet link.
> In contrast, OPA is a proprietary protocol. We don't know what primitives are 
> supported, and whether the semantics of supported primitives are the same as 
> in InfiniBand.

Intel has stated on this list that they intend for RDMA apps to run on
OPA transparently.  That pretty much implies the list of primitives and
everything else that they must support.  However, time will tell if they
succeeded or not.

>> The new OPA stuff appears to be following *exactly* the same development
>> model/path that RoCE did.  When RoCE was introduced, all the apps that
>> really cared about low level addressing on the link layer had to be
>> modified to encompass the new link type.  This is simply link_layer
>> number three for apps to care about.
>>
> 
> You are missing my point. API transparency is not a synonym for full semantic 
> equivalence.  The Node Type doesn’t indicate level of adherence to an API. 
> Node Type indicates compliancy to a  specification (e.g. wire protocol, 
> remote order of execution, error semantics, architectural limitations, etc). 
> The IBTA CA and Switch Node Types belong to devices that are compliant to the 
> corresponding specifications from the InfiniBand Trade Association.  And that 
> doesn’t prevent applications to choose to be coded to run over nodes of 
> different Node Type as it happens today with IB/RoCE and iWARP.
> 
> This has nothing to do with addressing.

And whether you like it or not, Intel is intentionally creating a
device/fabric with the specific intention of mimicking the IB_CA device
type (with stated exceptions for MAD packets and addresses).  They
obviously won't have certification as an IB_CA, but that's not their
aim.  Their aim is to be a functional drop in replacement that apps
don't need to know about except for the stated exceptions.

And I'm not missing your point.  Your point is inappropriate.  You're
trying to conflate certification with a functional API.  The IB_CA node
type is not an official certification of anything, and the linux kernel
is not an official certifying body for anything.  If you want
certification, you go to the OFA and the UNH-IOL testing program.
There, you have the rights to the certification branding logo and you
have the right to deny access to that logo to anyone that doesn't meet
the branding requirements.

You're right that apps can be coded to other CA types, like RNICs and
USNICs.  However, those are all very different from an IB_CA due to
limited queue pair types or limited primitives.  If OPA had that same
limitation then I would agree it needs a different node type.

So this will be my litmus test.  Currently, an app that supports all of
the RDMA types looks like this:

if (node_type == RNIC)
do iwarpy stuff
else if (node_type == USNIC)
do USNIC stuff
else if (node_type == IB_CA)
do IB verbs stuff
if (link_layer == Ethernet)
do RoCE addressing/management
else
do IB addressing/management



If, in the end, apps that are modified to support OPA end up looking
like this:

if (node_type == RNIC)
do iwarpy stuff
else if (node_type == USNIC)
do USNIC stuff
else if (node_type == IB_CA || node_type == OPA_CA)
do IB verbs stuff
if (node_type == OPA_CA)
do OPA addressing/manage

RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-14 Thread Liran Liss
> From: Doug Ledford [mailto:dledf...@redhat.com]

> > But the node_type stands for more than just an abstract RDMA device:
> > In IB, it designates an instance of an industry-standard, well-defined,
> device type: it's possible link types, transport, semantics, management,
> everything.
> > It *should* be exposed to user-space so apps that know and care what
> they are running on could continue to work.
> 
> I'm sorry, but your argument here is not very convincing at all.  And
> it's somewhat hypocritical.  When RoCE was first introduced, the *exact*
> same argument could be used to argue for why RoCE should require a new
> node_type.  Except then, because RoCE was your own, you argued for, and
> got, an expansion of the IB node_type definition that now included a
> relevant link_layer attribute that apps never needed to care about
> before.  However, now you are a victim of your own success.  You set the
> standard then that if the new device can properly emulate an IB Verbs/IB
> Link Layer device in terms of A) supported primitives (iWARP and usNIC
> both fail here, and hence why they have their own node_types) and B)
> queue pair creation process modulo link layer specific addressing
> attributes, then that device qualifies to use the IB_CA node_type and
> merely needs only a link_layer attribute to differentiate it.
> 

No. RoCE is as an open standard from the IBTA with the exact same RDMA protocol 
semantics as InfiniBand and a clear set of compliancy rules without which an 
implementation can't claim to be such. A RoCE device *is* an IB CA with an 
Ethernet link.
In contrast, OPA is a proprietary protocol. We don't know what primitives are 
supported, and whether the semantics of supported primitives are the same as in 
InfiniBand.

> The new OPA stuff appears to be following *exactly* the same development
> model/path that RoCE did.  When RoCE was introduced, all the apps that
> really cared about low level addressing on the link layer had to be
> modified to encompass the new link type.  This is simply link_layer
> number three for apps to care about.
> 

You are missing my point. API transparency is not a synonym for full semantic 
equivalence.  The Node Type doesn’t indicate level of adherence to an API. Node 
Type indicates compliancy to a  specification (e.g. wire protocol, remote order 
of execution, error semantics, architectural limitations, etc). The IBTA CA and 
Switch Node Types belong to devices that are compliant to the corresponding 
specifications from the InfiniBand Trade Association.  And that doesn’t prevent 
applications to choose to be coded to run over nodes of different Node Type as 
it happens today with IB/RoCE and iWARP.

This has nothing to do with addressing.



Re: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-12 Thread Doug Ledford
On 06/11/2015 02:27 PM, Liran Liss wrote:
>> From: Doug Ledford [mailto:dledf...@redhat.com]
> 
> OPA cannot impersonate IB; OPA node and link types have to be
> designated as such.  In terms of MAD processing flows, both
> explicit (as in the handle_opa_smi() call below) and implicit code
> paths (which share IB flows - there are several cases) must make
> this distinction.

 As far as in the kernel is concerned, the individual capability bits
 are much more important.  I would actually like to do away with the
 node_type variable from struct ib_device eventually.  As for user
 space,
> 
> We agreed on the concept of capability bits for the sake of simplifying code 
> sharing.
> That is OK.
> 
> But the node_type stands for more than just an abstract RDMA device:
> In IB, it designates an instance of an industry-standard, well-defined, 
> device type: it's possible link types, transport, semantics, management, 
> everything.
> It *should* be exposed to user-space so apps that know and care what they are 
> running on could continue to work.

I'm sorry, but your argument here is not very convincing at all.  And
it's somewhat hypocritical.  When RoCE was first introduced, the *exact*
same argument could be used to argue for why RoCE should require a new
node_type.  Except then, because RoCE was your own, you argued for, and
got, an expansion of the IB node_type definition that now included a
relevant link_layer attribute that apps never needed to care about
before.  However, now you are a victim of your own success.  You set the
standard then that if the new device can properly emulate an IB Verbs/IB
Link Layer device in terms of A) supported primitives (iWARP and usNIC
both fail here, and hence why they have their own node_types) and B)
queue pair creation process modulo link layer specific addressing
attributes, then that device qualifies to use the IB_CA node_type and
merely needs only a link_layer attribute to differentiate it.

The new OPA stuff appears to be following *exactly* the same development
model/path that RoCE did.  When RoCE was introduced, all the apps that
really cared about low level addressing on the link layer had to be
modified to encompass the new link type.  This is simply link_layer
number three for apps to care about.

> The place for abstraction is in the rdmacm/CMA, which serves applications 
> that just
> want some RDMA functionality regardless of the underlying technology.
> 
>>>
>>> All SMI code has different behavior if it is running on a switch or
>>> HCA, so testing for 'switchyness' is very appropriate here.
>>
>> Sure...
>>
>>> cap_is_switch_smi would be a nice refinement to let us drop nodetype.
>>
>> Exactly, we need a bit added to the immutable data bits, and a new cap_
>> helper, and then nodetype is ready to be retired.  Add a bit, drop a
>> u8 ;-)
>>
> 
> This is indeed a viable solution.
> 
>>> I don't have a problem with sharing the IBA constant names for MAD
>>> structures (like RDMA_NODE_IB_SWITCH) between IB and OPA code. They
>>> already share the structure layouts/etc.
>>>
> 
> The node type is reflected to user-space, which, as I mentioned above, is 
> important.
> Abusing this enumeration is misleading, even in the kernel.
> Jason's proposal for a 'cap_is_switch_smi' is more readable, and directly in 
> line with
> the explicit capability approach that we discussed.
> 
> N�r��y���b�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
> ���j:+v���w�j�mzZ+�ݢj"��!tml=
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-11 Thread Hal Rosenstock
On 6/11/2015 7:52 PM, Hefty, Sean wrote:
>>> I agree that the node type enum isn't particularly useful and should be
>> retired.
>>
>> Are you referring to kernel space or user space or both ?
> 
> Short term, kernel space.  User space needs to keep something around for 
> backwards compatibility.
> 
> But the in tree code will never expose this value up.
> 
>>> But even if we do, I'm not sure this is the correct approach.  I don't
>> know this for a fact,
>>> but it seems more likely that someone would embed Linux on an IB switch
>> than they would plug an IB switch
>>> into a Linux based system.  The code is designed around the latter.
>> Making this a system wide setting might simplify the code and optimize the
>> code paths.
>>
>> I think we need to discuss how user space would be addressed.
> 
> This is an issue with out of tree drivers.  We're having to guess what things 
> might be doing.  
> Are all devices being exposed up as a 'switch', or is there ever a case where 
> there's a 'switch' device 
> and an HCA device being reported together, or (highly unlikely) a switch 
> device and an RNIC?

Gateways are comprised of switch + HCA devices. There are other more
complex cases of multiple devices.

> If the real use case is to embed Linux on a switch, then we could look at 
> making that a system wide setting, 
> rather than per device.  This could clean up the kernel without impacting the 
> uABI.

I think that system wide is too limiting and it needs to be on a per
device basis.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-11 Thread Hefty, Sean
> > I agree that the node type enum isn't particularly useful and should be
> retired.
> 
> Are you referring to kernel space or user space or both ?

Short term, kernel space.  User space needs to keep something around for 
backwards compatibility.

But the in tree code will never expose this value up.

> > But even if we do, I'm not sure this is the correct approach.  I don't
> know this for a fact,
> > but it seems more likely that someone would embed Linux on an IB switch
> than they would plug an IB switch
> > into a Linux based system.  The code is designed around the latter.
> Making this a system wide setting might simplify the code and optimize the
> code paths.
> 
> I think we need to discuss how user space would be addressed.

This is an issue with out of tree drivers.  We're having to guess what things 
might be doing.  Are all devices being exposed up as a 'switch', or is there 
ever a case where there's a 'switch' device and an HCA device being reported 
together, or (highly unlikely) a switch device and an RNIC?

If the real use case is to embed Linux on a switch, then we could look at 
making that a system wide setting, rather than per device.  This could clean up 
the kernel without impacting the uABI.


Re: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-11 Thread Hal Rosenstock
On 6/11/2015 5:00 PM, Hefty, Sean wrote:
>>> cap_is_switch_smi would be a nice refinement to let us drop nodetype.
>>
>> Exactly, we need a bit added to the immutable data bits, and a new cap_
>> helper, and then nodetype is ready to be retired.  Add a bit, drop a
>> u8 ;-)
> 
> I agree that the node type enum isn't particularly useful and should be 
> retired.  

Are you referring to kernel space or user space or both ?

> In fact, I don't see where RDMA_NODE_IB_SWITCH is used by any upstream 
> device.  

While not upstream, there are at least 2 vendors with one or more switch
device drivers using the upstream stack.

> So I don't think there's any obligation to keep it.  

In kernel space, we can get rid of it but it's exposed by verbs and
currently relied upon in user space in a number of places.

There's one kernel place that needs more than just cap_is_switch_smi().

> But even if we do, I'm not sure this is the correct approach.  I don't know 
> this for a fact, 
> but it seems more likely that someone would embed Linux on an IB switch than 
> they would plug an IB switch 
> into a Linux based system.  The code is designed around the latter.  Making 
> this a system wide setting might simplify the code and optimize the code 
> paths.

I think we need to discuss how user space would be addressed.

-- Hal

> - Sean
> N�r��y���b�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
> ���j:+v���w�j�mzZ+��ݢj"��!tml=

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-11 Thread Hefty, Sean
> > cap_is_switch_smi would be a nice refinement to let us drop nodetype.
> 
> Exactly, we need a bit added to the immutable data bits, and a new cap_
> helper, and then nodetype is ready to be retired.  Add a bit, drop a
> u8 ;-)

I agree that the node type enum isn't particularly useful and should be 
retired.  In fact, I don't see where RDMA_NODE_IB_SWITCH is used by any 
upstream device.  So I don't think there's any obligation to keep it.  But even 
if we do, I'm not sure this is the correct approach.  I don't know this for a 
fact, but it seems more likely that someone would embed Linux on an IB switch 
than they would plug an IB switch into a Linux based system.  The code is 
designed around the latter.  Making this a system wide setting might simplify 
the code and optimize the code paths.

- Sean
N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-11 Thread Liran Liss
> From: Doug Ledford [mailto:dledf...@redhat.com]

> > > > OPA cannot impersonate IB; OPA node and link types have to be
> > > > designated as such.  In terms of MAD processing flows, both
> > > > explicit (as in the handle_opa_smi() call below) and implicit code
> > > > paths (which share IB flows - there are several cases) must make
> > > > this distinction.
> > >
> > > As far as in the kernel is concerned, the individual capability bits
> > > are much more important.  I would actually like to do away with the
> > > node_type variable from struct ib_device eventually.  As for user
> > > space,

We agreed on the concept of capability bits for the sake of simplifying code 
sharing.
That is OK.

But the node_type stands for more than just an abstract RDMA device:
In IB, it designates an instance of an industry-standard, well-defined, device 
type: it's possible link types, transport, semantics, management, everything.
It *should* be exposed to user-space so apps that know and care what they are 
running on could continue to work.

The place for abstraction is in the rdmacm/CMA, which serves applications that 
just
want some RDMA functionality regardless of the underlying technology.

> >
> > All SMI code has different behavior if it is running on a switch or
> > HCA, so testing for 'switchyness' is very appropriate here.
> 
> Sure...
> 
> > cap_is_switch_smi would be a nice refinement to let us drop nodetype.
> 
> Exactly, we need a bit added to the immutable data bits, and a new cap_
> helper, and then nodetype is ready to be retired.  Add a bit, drop a
> u8 ;-)
> 

This is indeed a viable solution.

> > I don't have a problem with sharing the IBA constant names for MAD
> > structures (like RDMA_NODE_IB_SWITCH) between IB and OPA code. They
> > already share the structure layouts/etc.
> >

The node type is reflected to user-space, which, as I mentioned above, is 
important.
Abusing this enumeration is misleading, even in the kernel.
Jason's proposal for a 'cap_is_switch_smi' is more readable, and directly in 
line with
the explicit capability approach that we discussed.



Re: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-10 Thread Doug Ledford
On Wed, 2015-06-10 at 12:56 -0600, Jason Gunthorpe wrote:
> On Wed, Jun 10, 2015 at 02:37:26PM -0400, Doug Ledford wrote:
> > On Wed, 2015-06-10 at 06:30 +, Liran Liss wrote:
> > > > From: Ira Weiny 
> > > 
> > > Hi Ira,
> > > 
> > > OPA cannot impersonate IB; OPA node and link types have to be
> > > designated as such.  In terms of MAD processing flows, both
> > > explicit (as in the handle_opa_smi() call below) and implicit code
> > > paths (which share IB flows - there are several cases) must make
> > > this distinction.
> > 
> > As far as in the kernel is concerned, the individual capability bits
> > are much more important.  I would actually like to do away with the
> > node_type variable from struct ib_device eventually.  As for user
> > space,
> 
> All SMI code has different behavior if it is running on a switch or
> HCA, so testing for 'switchyness' is very appropriate here.

Sure...

> cap_is_switch_smi would be a nice refinement to let us drop nodetype.

Exactly, we need a bit added to the immutable data bits, and a new cap_
helper, and then nodetype is ready to be retired.  Add a bit, drop a
u8 ;-)

> I don't have a problem with sharing the IBA constant names for MAD
> structures (like RDMA_NODE_IB_SWITCH) between IB and OPA code. They
> already share the structure layouts/etc.
> 
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


Re: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-10 Thread Jason Gunthorpe
On Wed, Jun 10, 2015 at 02:37:26PM -0400, Doug Ledford wrote:
> On Wed, 2015-06-10 at 06:30 +, Liran Liss wrote:
> > > From: Ira Weiny 
> > 
> > Hi Ira,
> > 
> > OPA cannot impersonate IB; OPA node and link types have to be
> > designated as such.  In terms of MAD processing flows, both
> > explicit (as in the handle_opa_smi() call below) and implicit code
> > paths (which share IB flows - there are several cases) must make
> > this distinction.
> 
> As far as in the kernel is concerned, the individual capability bits
> are much more important.  I would actually like to do away with the
> node_type variable from struct ib_device eventually.  As for user
> space,

All SMI code has different behavior if it is running on a switch or
HCA, so testing for 'switchyness' is very appropriate here.
cap_is_switch_smi would be a nice refinement to let us drop nodetype.

I don't have a problem with sharing the IBA constant names for MAD
structures (like RDMA_NODE_IB_SWITCH) between IB and OPA code. They
already share the structure layouts/etc.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-10 Thread Doug Ledford
On Wed, 2015-06-10 at 06:30 +, Liran Liss wrote:
> > From: Ira Weiny 
> 
> Hi Ira,
> 
> OPA cannot impersonate IB; OPA node and link types have to be designated as 
> such.
> In terms of MAD processing flows, both explicit (as in the handle_opa_smi() 
> call below) and implicit code paths (which share IB flows - there are several 
> cases) must make this distinction.

As far as in the kernel is concerned, the individual capability bits are
much more important.  I would actually like to do away with the
node_type variable from struct ib_device eventually.  As for user space,
where we have to maintain ABI, node_type can be IB_CA (after all, the
OPA devices are just like RoCE devices in that they implement IB VERBS
as their user visible transport, and only addressing/management is
different from link layer IB devices), link layer needs to be OPA.

> > +static enum smi_action
> > +handle_opa_smi(struct ib_mad_port_private *port_priv,
> > +  struct ib_mad_qp_info *qp_info,
> > +  struct ib_wc *wc,
> > +  int port_num,
> > +  struct ib_mad_private *recv,
> > +  struct ib_mad_private *response)
> > +{
> ...
> > +   } else if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH)  <
> 
> --Liran


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


Re: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-10 Thread ira.weiny
On Wed, Jun 10, 2015 at 06:30:58AM +, Liran Liss wrote:
> > From: Ira Weiny 
> 
> Hi Ira,
> 
> OPA cannot impersonate IB; OPA node and link types have to be designated as 
> such.

This was discussed at length and we agreed that the kernel would have explicit
capabilities communicated between the drivers and the core layers rather than
using link layer to determine what core support was needed.

For Node Type, OPA is its own "namespace" and as such we use the same values
for "CA" and "Switch".  The code you reference below is explicitly executed
only on OPA devices so I don't see why this is in conflict with IB.

> In terms of MAD processing flows, both explicit (as in the handle_opa_smi() 
> call below) and implicit code paths (which share IB flows - there are several 
> cases) must make this distinction.
> 

I agreed and all OPA differences are limited to device/ports which explicitly
indicate they are OPA ports.

For example:

opa = rdma_cap_opa_mad(qp_info->port_priv->device,
   qp_info->port_priv->port_num);

...

if (opa && ((struct ib_mad_hdr *)(recv->mad))->base_version == 
OPA_MGMT_BASE_VERSION) {
recv->header.recv_wc.mad_len = wc->byte_len - sizeof(struct 
ib_grh);
recv->header.recv_wc.mad_seg_size = sizeof(struct opa_mad);
} else {
recv->header.recv_wc.mad_len = sizeof(struct ib_mad);
recv->header.recv_wc.mad_seg_size = sizeof(struct ib_mad);
}


If I missed a place where this is not the case please let me know but I made
this change many months back and I'm pretty sure I caught them all.

Thanks,
Ira


> > +static enum smi_action
> > +handle_opa_smi(struct ib_mad_port_private *port_priv,
> > +  struct ib_mad_qp_info *qp_info,
> > +  struct ib_wc *wc,
> > +  int port_num,
> > +  struct ib_mad_private *recv,
> > +  struct ib_mad_private *response)
> > +{
> ...
> > +   } else if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH)  <
> 
> --Liran
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-06-09 Thread Liran Liss
> From: Ira Weiny 

Hi Ira,

OPA cannot impersonate IB; OPA node and link types have to be designated as 
such.
In terms of MAD processing flows, both explicit (as in the handle_opa_smi() 
call below) and implicit code paths (which share IB flows - there are several 
cases) must make this distinction.

> +static enum smi_action
> +handle_opa_smi(struct ib_mad_port_private *port_priv,
> +struct ib_mad_qp_info *qp_info,
> +struct ib_wc *wc,
> +int port_num,
> +struct ib_mad_private *recv,
> +struct ib_mad_private *response)
> +{
...
> + } else if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH)  <

--Liran
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-05-28 Thread Liran Liss
> >
> > Why do you have RDMA_NODE_IB_SWITCH related stuff inside the
> handle_opa_smi() function?
> > Is there a node type of "switch" in OPA similar to IB?
> >
> 
> Yes.  OPA uses the same node types as IB.
> 
> Ira
> 

No, OPA cannot impersonate IB.
It has to have distinct node and link types.

--Liran

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-05-21 Thread ira.weiny
On Wed, May 20, 2015 at 12:59:01PM -0600, Jason Gunthorpe wrote:
> On Wed, May 20, 2015 at 04:13:35AM -0400, ira.we...@intel.com wrote:
> > @@ -433,14 +436,23 @@ static inline int get_mad_len(struct mad_rmpp_recv 
> > *rmpp_recv)
> >  {
> > struct ib_rmpp_base *rmpp_base;
> > int hdr_size, data_size, pad;
> > +   int opa = rdma_cap_opa_mad(rmpp_recv->agent->qp_info->port_priv->device,
> > +  
> > rmpp_recv->agent->qp_info->port_priv->port_num);
> 
> bool opa

Thanks Fixed

Ira

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-05-20 Thread Suri Shelvapille
Can you please clarify:

+static enum smi_action
+handle_opa_smi(struct ib_mad_port_private *port_priv,
+   struct ib_mad_qp_info *qp_info,
+   struct ib_wc *wc,
+   int port_num,
+   struct ib_mad_private *recv,
+   struct ib_mad_private *response) {
+enum smi_forward_action retsmi;
+
+if (opa_smi_handle_dr_smp_recv(&recv->mad.opa_smp,
+   port_priv->device->node_type,
+   port_num,
+   port_priv->device->phys_port_cnt) ==
+   IB_SMI_DISCARD)
+return IB_SMI_DISCARD;
+
+retsmi = opa_smi_check_forward_dr_smp(&recv->mad.opa_smp);
+if (retsmi == IB_SMI_LOCAL)
+return IB_SMI_HANDLE;
+
+if (retsmi == IB_SMI_SEND) { /* don't forward */
+if (opa_smi_handle_dr_smp_send(&recv->mad.opa_smp,
+   port_priv->device->node_type,
+   port_num) == IB_SMI_DISCARD)
+return IB_SMI_DISCARD;
+
+if (opa_smi_check_local_smp(&recv->mad.opa_smp, port_priv->device) == 
IB_SMI_DISCARD)
+return IB_SMI_DISCARD;
+
+} else if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH) {
+/* forward case for switches */
+memcpy(response, recv, sizeof(*response));
+response->header.recv_wc.wc = &response->header.wc;
+response->header.recv_wc.recv_buf.opa_mad = &response->mad.opa_mad;
+response->header.recv_wc.recv_buf.grh = &response->grh;
+
+agent_send_response((struct ib_mad *)&response->mad.mad,
+&response->grh, wc,
+port_priv->device,
+opa_smi_get_fwd_port(&recv->mad.opa_smp),
+qp_info->qp->qp_num,
+recv->header.wc.byte_len,
+1);
+
+return IB_SMI_DISCARD;
+}
+
+return IB_SMI_HANDLE;
+}
+

Why do you have RDMA_NODE_IB_SWITCH related stuff inside the handle_opa_smi() 
function?
Is there a node type of "switch" in OPA similar to IB?


Thanks,
Suri

This correspondence, and any attachments or files transmitted with this 
correspondence, contains information which may be confidential and privileged 
and is intended solely for the use of the addressee. Unless you are the 
addressee or are authorized to receive messages for the addressee, you may not 
use, copy, disseminate, or disclose this correspondence or any information 
contained in this correspondence to any third party. If you have received this 
correspondence in error, please notify the sender immediately and delete this 
correspondence and any attachments or files transmitted with this 
correspondence from your system, and destroy any and all copies thereof, 
electronic or otherwise. Your cooperation and understanding are greatly 
appreciated.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-05-20 Thread ira.weiny
> 
> Why do you have RDMA_NODE_IB_SWITCH related stuff inside the handle_opa_smi() 
> function?
> Is there a node type of "switch" in OPA similar to IB?
> 

Yes.  OPA uses the same node types as IB.

Ira

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/14] IB/mad: Add final OPA MAD processing

2015-05-20 Thread Jason Gunthorpe
On Wed, May 20, 2015 at 04:13:35AM -0400, ira.we...@intel.com wrote:
> @@ -433,14 +436,23 @@ static inline int get_mad_len(struct mad_rmpp_recv 
> *rmpp_recv)
>  {
>   struct ib_rmpp_base *rmpp_base;
>   int hdr_size, data_size, pad;
> + int opa = rdma_cap_opa_mad(rmpp_recv->agent->qp_info->port_priv->device,
> +
> rmpp_recv->agent->qp_info->port_priv->port_num);

bool opa

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html