Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-28 Thread Or Gerlitz
On Tue, Apr 28, 2015 at 10:17 AM, Matan Barak  wrote:
> On 4/27/2015 9:22 PM, Or Gerlitz wrote:
> I think the real question is why to deal with RCUs that will require
> re-allocation of entries when it's not necessary or why do we want to use
> rwlock if the kernel provides a mechanism (called seqcount) that fits this
> problem better?
> I disagree about seqcount being complex - if you look at its API you'll find
> it's a lot simpler than RCU.

I took a 2nd look, seqcount is indeed way simpler from RCU, and by
itself is simple to use
if you feel this provides better solution vs. simple rwlock, so I'm
good with that.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-28 Thread Matan Barak



On 4/27/2015 9:22 PM, Or Gerlitz wrote:

On Mon, Apr 27, 2015 at 10:32 AM, Matan Barak  wrote:

On 4/26/2015 8:20 PM, Or Gerlitz wrote:

On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur
 wrote:

From: Matan Barak 



In order to manage multiple types, vlans and MACs per GID, we
need to store them along the GID itself. We store the net device
as well, as sometimes GIDs should be handled according to the
net device they came from. Since populating the GID table should
be identical for every RoCE provider, the GIDs table should be
handled in ib_core.

Adding a GID cache table that supports a lockless find, add and
delete gids. The lockless nature comes from using a unique
sequence number per table entry and detecting that while reading/
writing this sequence wasn't changed.



Matan, please use existing mechanism which fits the problem you are
trying to solve, I guess one of RCU or seqlock should do the job.



seqcount fits this problem better. Since if a write and read are done in
parallel, there's a good chance we read an out of date entry and we are
going to use a GID entry that's going to change in T+epsilon, so RCU doesn't
really have an advantage here.


So going back to the problem... we are talking on applications/drivers
that attempt to establish new connections doing reads and writes done
on behalf of IP stack changes, both are very much not critical path.
So this is kind of similar to the neighbour table maintained by ND
subsystem which is used by all IP based networking applications and
that code uses RCU. I don't see what's wrong with RCU for our sort
smaller scale subsystem and what is even wrong with simple rwlock
which is the mechanism used today by the IB core git cache, this goes
too complex and for no reason that I can think of.



I think the real question is why to deal with RCUs that will require 
re-allocation of entries when it's not necessary or why do we want to 
use rwlock if the kernel provides a mechanism (called seqcount) that 
fits this problem better?
I disagree about seqcount being complex - if you look at its API you'll 
find it's a lot simpler than RCU.





The current implementation is a bit more efficient than seqcount, as it
allows early termination of read-while-write (because the write puts a known
"currently updating" value that the read knows to ignore). AFAIK, this
doesn't exist in the current seqcount implementation. However, since this
isn't a crucial data-path, I'll change that to seqcount.

seqcount is preferred over seqlock, as I don't need the spinlock in seqlock.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-27 Thread Or Gerlitz
On Mon, Apr 27, 2015 at 10:32 AM, Matan Barak  wrote:
> On 4/26/2015 8:20 PM, Or Gerlitz wrote:
>> On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur
>>  wrote:
>>> From: Matan Barak 

>>> In order to manage multiple types, vlans and MACs per GID, we
>>> need to store them along the GID itself. We store the net device
>>> as well, as sometimes GIDs should be handled according to the
>>> net device they came from. Since populating the GID table should
>>> be identical for every RoCE provider, the GIDs table should be
>>> handled in ib_core.
>>>
>>> Adding a GID cache table that supports a lockless find, add and
>>> delete gids. The lockless nature comes from using a unique
>>> sequence number per table entry and detecting that while reading/
>>> writing this sequence wasn't changed.

>> Matan, please use existing mechanism which fits the problem you are
>> trying to solve, I guess one of RCU or seqlock should do the job.

> seqcount fits this problem better. Since if a write and read are done in
> parallel, there's a good chance we read an out of date entry and we are
> going to use a GID entry that's going to change in T+epsilon, so RCU doesn't
> really have an advantage here.

So going back to the problem... we are talking on applications/drivers
that attempt to establish new connections doing reads and writes done
on behalf of IP stack changes, both are very much not critical path.
So this is kind of similar to the neighbour table maintained by ND
subsystem which is used by all IP based networking applications and
that code uses RCU. I don't see what's wrong with RCU for our sort
smaller scale subsystem and what is even wrong with simple rwlock
which is the mechanism used today by the IB core git cache, this goes
too complex and for no reason that I can think of.


> The current implementation is a bit more efficient than seqcount, as it
> allows early termination of read-while-write (because the write puts a known
> "currently updating" value that the read knows to ignore). AFAIK, this
> doesn't exist in the current seqcount implementation. However, since this
> isn't a crucial data-path, I'll change that to seqcount.
>
> seqcount is preferred over seqlock, as I don't need the spinlock in seqlock.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-27 Thread Matan Barak



On 4/26/2015 8:20 PM, Or Gerlitz wrote:

On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur
 wrote:

From: Matan Barak 

In order to manage multiple types, vlans and MACs per GID, we
need to store them along the GID itself. We store the net device
as well, as sometimes GIDs should be handled according to the
net device they came from. Since populating the GID table should
be identical for every RoCE provider, the GIDs table should be
handled in ib_core.

Adding a GID cache table that supports a lockless find, add and
delete gids. The lockless nature comes from using a unique
sequence number per table entry and detecting that while reading/
writing this sequence wasn't changed.


Matan, please use existing mechanism which fits the problem you are
trying to solve, I guess one of RCU or seqlock should do the job.



seqcount fits this problem better. Since if a write and read are done in 
parallel, there's a good chance we read an out of date entry and we are 
going to use a GID entry that's going to change in T+epsilon, so RCU 
doesn't really have an advantage here.


The current implementation is a bit more efficient than seqcount, as it 
allows early termination of read-while-write (because the write puts a 
known "currently updating" value that the read knows to ignore). AFAIK, 
this doesn't exist in the current seqcount implementation. However, 
since this isn't a crucial data-path, I'll change that to seqcount.


seqcount is preferred over seqlock, as I don't need the spinlock in seqlock.


Or.



Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-26 Thread Or Gerlitz
On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur
 wrote:
> From: Matan Barak 
>
> In order to manage multiple types, vlans and MACs per GID, we
> need to store them along the GID itself. We store the net device
> as well, as sometimes GIDs should be handled according to the
> net device they came from. Since populating the GID table should
> be identical for every RoCE provider, the GIDs table should be
> handled in ib_core.
>
> Adding a GID cache table that supports a lockless find, add and
> delete gids. The lockless nature comes from using a unique
> sequence number per table entry and detecting that while reading/
> writing this sequence wasn't changed.

Matan, please use existing mechanism which fits the problem you are
trying to solve, I guess one of RCU or seqlock should do the job.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-16 Thread Liran Liss
> 
> > The RoCE Verbs interface references the HCA GID table in QPs and AHs,
> > for all RoCE versions.
> 
> The IBTA specifically does not define software interfaces.  The concern here 
> is
> the architecture and definition of the linux rdma software stack, not verbs,
> despite the fact that the layer is named that.
>

Whatever.
The ib_* API is the Linux implementation for IBTA Verbs (both Infiniband and 
RoCE flavors) and iWARP Verbs.
This interface uses a GID table.
In this patch-set, there is no conceptual change in this regard.
It adds GID types, as stated by the RoCE specification, which is a natural 
extension to the existing code.
I am sure that we can consider additional changes in the future, but one 
patch-set at a time...
 
> > As I said earlier, the network layer (e.g., IPv4, IPv6, or GRH) cannot
> > be a port attribute because RoCE HCAs support all of them over the
> > same physical port.
> 
> IMO, the architecture needs to change to support multiple network and
> transport layers over the same link layer.  The current restriction of 1
> "transport" per port is too restrictive, and doesn't seem to be true even for
> current devices, like usnic.
> 

usnic uses UDP transport.
Infiniband and RoCE use the IB transport (all the BTH+ header stuff as defined 
by IBTA).
iWARP uses IETF defined protocols over TCP, namely RDMAP+DDP+MPA.

I am not aware of any device that mixes these transports today.
If we ever see such a need in the future, let's talk again. Currently, this is 
irrelevant.

> > Maybe we should change these patches to encode the port "summary" as a
> > bit mask, and provide convenient masks for queries?
> 
> Switching to bit masks is the longer term objective.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-16 Thread Hefty, Sean
> The RoCE Verbs interface references the HCA GID table in QPs and AHs, for
> all RoCE versions.

The IBTA specifically does not define software interfaces.  The concern here is 
the architecture and definition of the linux rdma software stack, not verbs, 
despite the fact that the layer is named that.

> As I said earlier, the network layer (e.g., IPv4, IPv6, or GRH) cannot be
> a port attribute because RoCE HCAs support all of them over the same
> physical port.

IMO, the architecture needs to change to support multiple network and transport 
layers over the same link layer.  The current restriction of 1 "transport" per 
port is too restrictive, and doesn't seem to be true even for current devices, 
like usnic.

> Maybe we should change these patches to encode the port "summary" as a bit
> mask, and provide convenient masks for queries?

Switching to bit masks is the longer term objective.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-16 Thread Liran Liss
> > > RoCE v2 is really Infiniband over UDP over IP.  Why don't we just
> > > call
> > it IBoUDP like it is?
> > RoCEv2 is the name in the IBTA spec (Annex 17)
> 
> We call RoCE IBoE in the kernel, because that's what it is.  RoCE is an IBTA
> marketing name.
> 
> Looking through the Annex, I don't see where Ethernet is even a requirement
> for this technology to work.  The IB transport is layered over a standard UDP
> header.  I do see where the spec calls out updating the IP header, but that's 
> it.
> 
> Regardless of what it's called, it replaces the underlying network and
> transport protocols, versus IB-classic or IBoE/RoCE.  That should be captured
> properly, not by saying there's a new GID type.  RoCE v2 doesn't even use
> GIDs as part of its protocol.  It uses UDP/IP addresses.
> 
> 

The RoCE Verbs interface references the HCA GID table in QPs and AHs, for all 
RoCE versions.
The specification mandates that the Verbs consumer may use the following 
protocols over the same RoCE-capable HCA and the same physical port:
- RoCEv1 (L2, IB GRH, IB BTH+, payload)
- RoCEv2 using IPv4 (L2, IPv4, UDP, IB BTH+, payload)
- RoCEv2 using IPv6 (L2, IPv6, UDP, IB BTH+, payload)
The distinction (by spec) is done by associating a GID type attribute to each 
GID table entry, which is either IB, IPv4, or IPv6.
This is how apps can create different RC QPs using these different wire 
protocols, or a single UD QP that can send packets to all of these wire 
protocols.

Perhaps we could add another enum entry for RoCEv1 in the patch to make this 
clearer.

> > > IBoUDP changes the Ethertype, replaces the network header, adds a
> > > new
> > transport protocol header, and layers IB over that.  This change
> > should be exposed properly and not as just a new GID type.
> > I don't understand what do you suggest here. Can you give an example?
> 
> I don't have a solution here.  Please look at Michael Wang's patch series and
> see how this would fit into that model.  The introduction of iWarp required
> defining a new 'transport' type.  IBoE added a new link layer.  Based on those
> changes, this would warrant introducing a new network layer, so that it can
> be distinguished properly from the other options.  Maybe that's the right
> approach?
> 

The "new-transport" in Michael's patches doesn't refer to the network transport 
layer, but rather acts as a summary of the tuple  
(the network layer is indeed skipped).
The network transport layer of Infiniband and *all* RoCE types is the same: IB 
transport.
As I said earlier, the network layer (e.g., IPv4, IPv6, or GRH) cannot be a 
port attribute because RoCE HCAs support all of them over the same physical 
port.

Maybe we should change these patches to encode the port "summary" as a bit 
mask, and provide convenient masks for queries?
Alternatively, we could leave the port qualifiers as they are (i.e., with 
distinct  qualifiers>) instead of introducing 
"new-transport", but provide convenient wrappers for ULPs to use.
For example:
- is_roce() /* returns true for all RoCE wire protocols */
- is_rocev1()
- is_rocev2()
- is_iwarp()
- ...

> Cisco's NIC reports a transport layer of 'USNIC_UDP', which should really just
> be 'UDP'.  That NIC supports UDP/IP/Ethernet, based on the rdma stack's
> model.
> RoCE v2 is also UDP/IP/Ethernet, it only layers IB over that.  (This makes the
> use of the term 'transport' confusing.  Maybe there should also be a 'session'
> protocol?)  It seems completely reasonable that a device which does
> IB/UDP/IP/Ethernet to someday expose the UDP/IP/Ethernet portion (if it
> doesn't already), and from the same port at the same time.

RoCEv2 uses UDP only as an encapsulation protocol (just like VXLAN).
The transport is well-defined: IBTA transport (just like Infiniband).
In the USNIC case, UPD is indeed the actual network transport.

> 
> Rather than continuing to try to make everything look like an IB-classic 
> device
> because it's convenient, the stack needs to start exposing things properly.  I
> don't know what the right solution should be, but trying to capture this level
> of detail as a different GID type definitely looks like a step in the wrong
> direction.

As mentioned above, this is how Verbs consumers send traffic using the 
different wire-protocols over the same physical port.
This patch-set follows the specification, and cleanly provides all per-GID 
associations in 'struct ib_gid_attr'.

So basically, in RoCE, a GID entry represents a network interface (a netdev) to 
the HCA, and encompasses all information related with that interface: MAC, 
VLAN, MTU, IP address, namespace, etc.
This also makes it straight forward to add future extensions.

> 
> - Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-16 Thread Hefty, Sean
> > RoCE v2 is really Infiniband over UDP over IP.  Why don't we just call
> it IBoUDP like it is?
> RoCEv2 is the name in the IBTA spec (Annex 17)

We call RoCE IBoE in the kernel, because that's what it is.  RoCE is an IBTA 
marketing name.

Looking through the Annex, I don't see where Ethernet is even a requirement for 
this technology to work.  The IB transport is layered over a standard UDP 
header.  I do see where the spec calls out updating the IP header, but that's 
it.

Regardless of what it's called, it replaces the underlying network and 
transport protocols, versus IB-classic or IBoE/RoCE.  That should be captured 
properly, not by saying there's a new GID type.  RoCE v2 doesn't even use GIDs 
as part of its protocol.  It uses UDP/IP addresses.


> > IBoUDP changes the Ethertype, replaces the network header, adds a new
> transport protocol header, and layers IB over that.  This change should be
> exposed properly and not as just a new GID type.
> I don't understand what do you suggest here. Can you give an example?

I don't have a solution here.  Please look at Michael Wang's patch series and 
see how this would fit into that model.  The introduction of iWarp required 
defining a new 'transport' type.  IBoE added a new link layer.  Based on those 
changes, this would warrant introducing a new network layer, so that it can be 
distinguished properly from the other options.  Maybe that's the right approach?

Cisco's NIC reports a transport layer of 'USNIC_UDP', which should really just 
be 'UDP'.  That NIC supports UDP/IP/Ethernet, based on the rdma stack's model.  
RoCE v2 is also UDP/IP/Ethernet, it only layers IB over that.  (This makes the 
use of the term 'transport' confusing.  Maybe there should also be a 'session' 
protocol?)  It seems completely reasonable that a device which does 
IB/UDP/IP/Ethernet to someday expose the UDP/IP/Ethernet portion (if it doesn't 
already), and from the same port at the same time.

Rather than continuing to try to make everything look like an IB-classic device 
because it's convenient, the stack needs to start exposing things properly.  I 
don't know what the right solution should be, but trying to capture this level 
of detail as a different GID type definitely looks like a step in the wrong 
direction.

- Sean



Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-16 Thread Moni Shoua
On Wed, Apr 15, 2015 at 7:08 PM, Hefty, Sean  wrote:
>> It does not  break every app, the choice of which GID type to use is made
>> by the RDMA-CM based on network topology hint obtained from the IP stack.
>> Please refer to patch 15/33: IB/Core: Changes to the IB Core
>> infrastructure for RoCEv2 support.
>> Of course, if the user does not want to go with this choice made by the
>> RDMA-CM, then there is the option of overriding it using the configfs
>> patch (PATCH 14/33)
>> Hope that clarifies?
>
> RoCE v2 is really Infiniband over UDP over IP.  Why don't we just call it 
> IBoUDP like it is?
RoCEv2 is the name in the IBTA spec (Annex 17)
>
> IBoUDP changes the Ethertype, replaces the network header, adds a new 
> transport protocol header, and layers IB over that.  This change should be 
> exposed properly and not as just a new GID type.
I don't understand what do you suggest here. Can you give an example?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-16 Thread Matan Barak
AFAIK, RoCE v2 is the known and official name. Why would we like to come 
up with a customized name?


These are indeed two different protocols, thus the comparison to DHCP 
assigned addresses and static addresses is (to say the least) a bit off.


Even when comparing IPv4 and IPv6, the most significant user change is 
the sockaddr_in and sockaddr_in6 addresses. IMHO, since the GID format 
is identical, the changes could be encoded in a gid_type argument. The 
gid_type is an inherent part of the address, making identical gids with 
different gid_types as two different addresses, as expected.


On 4/15/2015 7:21 PM, Suri Shelvapille wrote:

IMHO, it would be good to have a physical layer representation in the naming 
convention.

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Hefty, Sean
Sent: Wednesday, April 15, 2015 12:08 PM
To: Somnath Kotur; Matan Barak; rol...@kernel.org
Cc: linux-rdma@vger.kernel.org
Subject: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache


It does not  break every app, the choice of which GID type to use is
made by the RDMA-CM based on network topology hint obtained from the IP stack.
Please refer to patch 15/33: IB/Core: Changes to the IB Core
infrastructure for RoCEv2 support.
Of course, if the user does not want to go with this choice made by
the RDMA-CM, then there is the option of overriding it using the
configfs patch (PATCH 14/33) Hope that clarifies?


RoCE v2 is really Infiniband over UDP over IP.  Why don't we just call it 
IBoUDP like it is?

IBoUDP changes the Ethertype, replaces the network header, adds a new transport 
protocol header, and layers IB over that.  This change should be exposed 
properly and not as just a new GID type.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html

This correspondence, and any attachments or files transmitted with this 
correspondence, contains information which may be confidential and privileged 
and is intended solely for the use of the addressee. Unless you are the 
addressee or are authorized to receive messages for the addressee, you may not 
use, copy, disseminate, or disclose this correspondence or any information 
contained in this correspondence to any third party. If you have received this 
correspondence in error, please notify the sender immediately and delete this 
correspondence and any attachments or files transmitted with this 
correspondence from your system, and destroy any and all copies thereof, 
electronic or otherwise. Your cooperation and understanding are greatly 
appreciated.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-15 Thread Suri Shelvapille
IMHO, it would be good to have a physical layer representation in the naming 
convention.

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Hefty, Sean
Sent: Wednesday, April 15, 2015 12:08 PM
To: Somnath Kotur; Matan Barak; rol...@kernel.org
Cc: linux-rdma@vger.kernel.org
Subject: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

> It does not  break every app, the choice of which GID type to use is
> made by the RDMA-CM based on network topology hint obtained from the IP stack.
> Please refer to patch 15/33: IB/Core: Changes to the IB Core
> infrastructure for RoCEv2 support.
> Of course, if the user does not want to go with this choice made by
> the RDMA-CM, then there is the option of overriding it using the
> configfs patch (PATCH 14/33) Hope that clarifies?

RoCE v2 is really Infiniband over UDP over IP.  Why don't we just call it 
IBoUDP like it is?

IBoUDP changes the Ethertype, replaces the network header, adds a new transport 
protocol header, and layers IB over that.  This change should be exposed 
properly and not as just a new GID type.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html

This correspondence, and any attachments or files transmitted with this 
correspondence, contains information which may be confidential and privileged 
and is intended solely for the use of the addressee. Unless you are the 
addressee or are authorized to receive messages for the addressee, you may not 
use, copy, disseminate, or disclose this correspondence or any information 
contained in this correspondence to any third party. If you have received this 
correspondence in error, please notify the sender immediately and delete this 
correspondence and any attachments or files transmitted with this 
correspondence from your system, and destroy any and all copies thereof, 
electronic or otherwise. Your cooperation and understanding are greatly 
appreciated.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-15 Thread Hefty, Sean
> It does not  break every app, the choice of which GID type to use is made
> by the RDMA-CM based on network topology hint obtained from the IP stack.
> Please refer to patch 15/33: IB/Core: Changes to the IB Core
> infrastructure for RoCEv2 support.
> Of course, if the user does not want to go with this choice made by the
> RDMA-CM, then there is the option of overriding it using the configfs
> patch (PATCH 14/33)
> Hope that clarifies?

RoCE v2 is really Infiniband over UDP over IP.  Why don't we just call it 
IBoUDP like it is?

IBoUDP changes the Ethertype, replaces the network header, adds a new transport 
protocol header, and layers IB over that.  This change should be exposed 
properly and not as just a new GID type.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-14 Thread Somnath Kotur


> -Original Message-
> From: Hefty, Sean [mailto:sean.he...@intel.com]
> Sent: Tuesday, April 14, 2015 11:02 PM
> To: Matan Barak; Somnath Kotur; rol...@kernel.org
> Cc: linux-rdma@vger.kernel.org
> Subject: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
> 
> > This is a part of the GID meta info. The user should be able to choose
> > between RoCE V1 (which is represented here by IB_GID_TYPE_IB) and
> RoCE
> > V2 - just as a user could choose between IPv6 and IPv4.
> 
> IPv4 and IPv6 are different protocols, not different formats for the same
> address.  How does RoCE v2 not break every app? 
It does not  break every app, the choice of which GID type to use is made by 
the RDMA-CM based on network topology hint obtained from the IP stack.
Please refer to patch 15/33: IB/Core: Changes to the IB Core infrastructure for 
RoCEv2 support.
Of course, if the user does not want to go with this choice made by the 
RDMA-CM, then there is the option of overriding it using the configfs patch 
(PATCH 14/33)
Hope that clarifies?

Thanks
Som
 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-14 Thread Hefty, Sean
> This is a part of the GID meta info. The user should be able to choose
> between RoCE V1 (which is represented here by IB_GID_TYPE_IB) and RoCE
> V2 - just as a user could choose between IPv6 and IPv4.

IPv4 and IPv6 are different protocols, not different formats for the same 
address.  How does RoCE v2 not break every app?  This isn't like asking the 
user to choose between IPv4 versus IPv6, it's asking them to choose between 
IPv4 assigned by DHCP versus IPv4 assigned statically.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-14 Thread Bart Van Assche

On 04/14/15 15:23, Matan Barak wrote:

+mutex_lock(&cache->lock);
+
+for (ix = 0; ix < cache->sz; ix++)
+if (cache->data_vec[ix].attr.ndev == ndev)
+write_gid(ib_dev, port, cache, ix, &zgid, &zattr);
+
+mutex_unlock(&cache->lock);
+return 0;


The traditional Linux kernel coding style is one blank line before
mutex_lock() and after mutex_unlock() but not after mutex_lock() nor
before mutex_unlock().



I didn't find this in the CodingStyle doc. Could you please quote or
post a link?


Hello Matan,

I'm not aware of any formal documentation of this style guideline. But 
if you look around in the Linux kernel tree you will see that most 
kernel code follows this style.


Bart.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-14 Thread Matan Barak



On 3/26/2015 1:42 AM, Bart Van Assche wrote:

On 03/25/2015 02:19 PM, Somnath Kotur wrote:

+if (cache->data_vec[ix].attr.ndev &&
+cache->data_vec[ix].attr.ndev != old_net_dev)


A few lines earlier the memory old_net_dev points at was freed. If two
instances of this function run concurrently, what prevents that the
old_net_dev memory has been reallocated and hence that attr.ndev ==
old_net_dev although both pointers refer(red) to different network
devices ?



write_gid is *almost* always called in a mutex. The only case it's not 
protected is in free_roce_gid_cache. free_roce_gid_cache is called only 
in the error flow of roce_gid_cache_setup_one, when no concurrent 
write_gid could be made (as the cache isn't setup yet) and in 
roce_gid_cache_cleanup_one. roce_gid_cache_client_setup_one is called in 
the error flow of roce_gid_cache_client_setup_one (where no other 
write_gid are expected for the same above reason) and in 
roce_gid_cache_client_cleanup_work_handler, where it's called after 
flush_workqueue(roce_gid_mgmt_wq). Since all write_gids are called 
through roce_gid_mgmt_wq and we set the cache to inactive mode before 
flushing the wq and freeing the cache, I think we can conclude no 
concurrent write_gid on the same cache are possible.



+ACCESS_ONCE(cache->data_vec[ix].seq) = orig_seq;


Invoking write_gid() is only safe if the caller serializes write_gid()
calls. Apparently the cache->lock mutex is used for that purpose. So why
is it necessary to use ACCESS_ONCE() here ? Why is it needed to prevent
that the compiler coalesces this write with another write into the same
structure ?



The mutex only serializes cache writes. Cache reads could be done in 
concurrent with writes and are protected by the ACCESS_ONCE.



+/* Make sure the sequence number we remeber was read


This looks like a typo - shouldn't the above read "remember" ?



Will be fixed in V4, thanks.


BTW, the style of that comment is recommended only for networking code
and not for IB code. Have you verified this patch with checkpatch ?



Of course and I've just re-run checkpatch on this patch. It doesn't 
catch this.



+mutex_lock(&cache->lock);
+
+for (ix = 0; ix < cache->sz; ix++)
+if (cache->data_vec[ix].attr.ndev == ndev)
+write_gid(ib_dev, port, cache, ix, &zgid, &zattr);
+
+mutex_unlock(&cache->lock);
+return 0;


The traditional Linux kernel coding style is one blank line before
mutex_lock() and after mutex_unlock() but not after mutex_lock() nor
before mutex_unlock().



I didn't find this in the CodingStyle doc. Could you please quote or 
post a link?



+orig_seq = ACCESS_ONCE(cache->data_vec[index].seq);
+/* Make sure we read the sequence number before copying the
+ * gid to local storage. */
+smp_rmb();


Please use READ_ONCE() instead of ACCESS_ONCE() as recommended in
.



Ok, I'll change that in V4. I see that READ_ONCE and WRITE_ONCE is 
different from ACCESS_ONCE only for aggregated data types (which isn't 
our case), but it won't hurt to change that.



+static void free_roce_gid_cache(struct ib_device *ib_dev, u8 port)
+{
+int i;
+struct ib_roce_gid_cache *cache =
+ib_dev->cache.roce_gid_cache[port - 1];
+
+if (!cache)
+return;
+
+for (i = 0; i < cache->sz; ++i) {
+if (memcmp(&cache->data_vec[i].gid, &zgid,
+   sizeof(cache->data_vec[i].gid)))
+write_gid(ib_dev, port, cache, i, &zgid, &zattr);
+}

 > +kfree(cache->data_vec);
 > +kfree(cache);
 > +}

Overwriting data just before it is freed is not useful. Please use
CONFIG_SLUB_DEBUG=y to debug use-after-free issues instead of such code.



It's mandatory as write_gid with zgid might cause the vendor driver to 
free memory it allocated for this GID entry (like in the mlx4 case).



Bart.


Thanks for the review :)

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-14 Thread Matan Barak



On 4/14/2015 2:50 AM, Hefty, Sean wrote:

Yes, this comment probably could use a reword..

+   IB_GID_TYPE_IB= 0,
+   IB_GID_TYPE_ROCE_V2   = 1,
+   IB_GID_TYPE_SIZE
+};


Can you explain the purpose of defining a 'GID type'.  A GID is just a

global

address.  Why does it matter to anyone using it how it was constructed?


This is part of RoCE V2 Specification.  Please refer to Section A 17.8 .
The GID Type determines the protocol for outbound packet generation i.e
RoCE V1 (0x8915 Ether Type) or RoCEV2 (IPv4 or IPv6)


This isn't an interface for the RoCE specification.  Why does this need to be 
added to the verbs interface?  It hasn't been needed by apps yet, and I don't 
see why the apps should be made to care now how the GID is formatted.



This is a part of the GID meta info. The user should be able to choose 
between RoCE V1 (which is represented here by IB_GID_TYPE_IB) and RoCE 
V2 - just as a user could choose between IPv6 and IPv4.



@@ -265,7 +295,9 @@ enum ib_port_cap_flags {
IB_PORT_BOOT_MGMT_SUP   = 1 << 23,
IB_PORT_LINK_LATENCY_SUP= 1 << 24,
IB_PORT_CLIENT_REG_SUP  = 1 << 25,
-   IB_PORT_IP_BASED_GIDS   = 1 << 26
+   IB_PORT_IP_BASED_GIDS   = 1 << 26,
+   IB_PORT_ROCE= 1 << 27,
+   IB_PORT_ROCE_V2 = 1 << 28,


Why does RoCE suddenly require a port capability bit?  RoCE runs today
without setting any bit.

Again, this is part of RoCE V2 SPEC, please refer to Section A17.5.1-
Query HCA(Pasting snippet below)
A new "RoCE Supported" capability bit shall be added to the Port
Attributes
list. This capability bit applies exclusively to ports of the new
"RoCEv2" type


Same comment as above.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-13 Thread Hefty, Sean
> Yes, this comment probably could use a reword..
> > > + IB_GID_TYPE_IB= 0,
> > > + IB_GID_TYPE_ROCE_V2   = 1,
> > > + IB_GID_TYPE_SIZE
> > > +};
> >
> > Can you explain the purpose of defining a 'GID type'.  A GID is just a
> global
> > address.  Why does it matter to anyone using it how it was constructed?
> 
> This is part of RoCE V2 Specification.  Please refer to Section A 17.8 .
> The GID Type determines the protocol for outbound packet generation i.e
> RoCE V1 (0x8915 Ether Type) or RoCEV2 (IPv4 or IPv6)

This isn't an interface for the RoCE specification.  Why does this need to be 
added to the verbs interface?  It hasn't been needed by apps yet, and I don't 
see why the apps should be made to care now how the GID is formatted.

> > > @@ -265,7 +295,9 @@ enum ib_port_cap_flags {
> > >   IB_PORT_BOOT_MGMT_SUP   = 1 << 23,
> > >   IB_PORT_LINK_LATENCY_SUP= 1 << 24,
> > >   IB_PORT_CLIENT_REG_SUP  = 1 << 25,
> > > - IB_PORT_IP_BASED_GIDS   = 1 << 26
> > > + IB_PORT_IP_BASED_GIDS   = 1 << 26,
> > > + IB_PORT_ROCE= 1 << 27,
> > > + IB_PORT_ROCE_V2 = 1 << 28,
> >
> > Why does RoCE suddenly require a port capability bit?  RoCE runs today
> > without setting any bit.
> Again, this is part of RoCE V2 SPEC, please refer to Section A17.5.1-
> Query HCA(Pasting snippet below)
> A new "RoCE Supported" capability bit shall be added to the Port
> Attributes
> list. This capability bit applies exclusively to ports of the new
> "RoCEv2" type

Same comment as above.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-08 Thread Moni Shoua
On Wed, Apr 8, 2015 at 2:30 AM, Hefty, Sean  wrote:
>> In order to manage multiple types, vlans and MACs per GID, we
>> need to store them along the GID itself. We store the net device
>> as well, as sometimes GIDs should be handled according to the
>> net device they came from. Since populating the GID table should
>> be identical for every RoCE provider, the GIDs table should be
>> handled in ib_core.
>>
>> Adding a GID cache table that supports a lockless find, add and
>> delete gids. The lockless nature comes from using a unique
>> sequence number per table entry and detecting that while reading/
>> writing this sequence wasn't changed.
>>
>> By using this RoCE GID cache table, providers must implement a
>> modify_gid callback. The table is managed exclusively by
>> this roce_gid_cache and the provider just need to write
>> the data to the hardware.
>>
>> Signed-off-by: Matan Barak 
>> Signed-off-by: Somnath Kotur 
>> ---
>>  drivers/infiniband/core/Makefile |   3 +-
>>  drivers/infiniband/core/core_priv.h  |  24 ++
>>  drivers/infiniband/core/roce_gid_cache.c | 518
>
> Why does RoCE need such a complex gid cache?  If a gid cache is needed at 
> all, why should it be restricted to RoCE only?  And why is such a complex 
> synchronization scheme needed?  Seriously, how many times will GIDs change 
> and how many readers at once do you expect to have?
>
GID cache is also implemented for link layer IB. Howver, for RoCE the
GID cache is also the manager of the table. This means that adding or
removing entries from the GID table is under the responsibility of the
cache and not the HW/device driver. This is a new scheme that frees
each vendor's driver to deal with net and inet events.
Content of the GID table is much more dynamic for RoCE than for IB and
so is access to the table so I guess that extra mechanism is required.
The fact that GID entry is associated with net_device and inet_addr
objects that can be modified/deleted at any time is an example.
>
>> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
>> index 65994a1..1866595 100644
>> --- a/include/rdma/ib_verbs.h
>> +++ b/include/rdma/ib_verbs.h
>> @@ -64,6 +64,36 @@ union ib_gid {
>>   } global;
>>  };
>>
>> +extern union ib_gid zgid;
>> +
>> +enum ib_gid_type {
>> + /* If link layer is Ethernet, this is RoCE V1 */
>
> I don't understand this comment.  Does RoCE v2 not run on Ethernet?
>
>> + IB_GID_TYPE_IB= 0,
>> + IB_GID_TYPE_ROCE_V2   = 1,
>> + IB_GID_TYPE_SIZE
>> +};
>
> Can you explain the purpose of defining a 'GID type'.  A GID is just a global 
> address.  Why does it matter to anyone using it how it was constructed?
>
>> +
>> +struct ib_gid_attr {
>> + enum ib_gid_typegid_type;
>> + struct net_device   *ndev;
>> +};
>> +
>> +struct ib_roce_gid_cache_entry {
>> + /* seq number of 0 indicates entry being changed. */
>> + unsigned intseq;
>> + union ib_gidgid;
>> + struct ib_gid_attr  attr;
>> + void   *context;
>> +};
>> +
>> +struct ib_roce_gid_cache {
>> + int  active;
>> + int  sz;
>> + /* locking against multiple writes in data_vec */
>> + struct mutex lock;
>> + struct ib_roce_gid_cache_entry *data_vec;
>> +};
>> +
>>  enum rdma_node_type {
>>   /* IB values map to NodeInfo:NodeType. */
>>   RDMA_NODE_IB_CA = 1,
>> @@ -265,7 +295,9 @@ enum ib_port_cap_flags {
>>   IB_PORT_BOOT_MGMT_SUP   = 1 << 23,
>>   IB_PORT_LINK_LATENCY_SUP= 1 << 24,
>>   IB_PORT_CLIENT_REG_SUP  = 1 << 25,
>> - IB_PORT_IP_BASED_GIDS   = 1 << 26
>> + IB_PORT_IP_BASED_GIDS   = 1 << 26,
>> + IB_PORT_ROCE= 1 << 27,
>> + IB_PORT_ROCE_V2 = 1 << 28,
>
> Why does RoCE suddenly require a port capability bit?  RoCE runs today 
> without setting any bit.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-07 Thread Somnath Kotur
Hi Sean,

> -Original Message-
> From: Hefty, Sean [mailto:sean.he...@intel.com]
> Sent: Wednesday, April 08, 2015 6:00 AM
> To: Somnath Kotur; rol...@kernel.org
> Cc: linux-rdma@vger.kernel.org; Matan Barak
> Subject: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
> 
> > In order to manage multiple types, vlans and MACs per GID, we need to
> > store them along the GID itself. We store the net device as well, as
> > sometimes GIDs should be handled according to the net device they came
> > from. Since populating the GID table should be identical for every
> > RoCE provider, the GIDs table should be handled in ib_core.
> >
> > Adding a GID cache table that supports a lockless find, add and delete
> > gids. The lockless nature comes from using a unique sequence number
> > per table entry and detecting that while reading/ writing this
> > sequence wasn't changed.
> >
> > By using this RoCE GID cache table, providers must implement a
> > modify_gid callback. The table is managed exclusively by this
> > roce_gid_cache and the provider just need to write the data to the
> > hardware.
> >
> > Signed-off-by: Matan Barak 
> > Signed-off-by: Somnath Kotur 
> > ---
> >  drivers/infiniband/core/Makefile |   3 +-
> >  drivers/infiniband/core/core_priv.h  |  24 ++
> >  drivers/infiniband/core/roce_gid_cache.c | 518
> 
> Why does RoCE need such a complex gid cache?  If a gid cache is needed at
> all, why should it be restricted to RoCE only?  And why is such a complex
> synchronization scheme needed?  Seriously, how many times will GIDs
> change and how many readers at once do you expect to have?
> 
> 
> > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index
> > 65994a1..1866595 100644
> > --- a/include/rdma/ib_verbs.h
> > +++ b/include/rdma/ib_verbs.h
> > @@ -64,6 +64,36 @@ union ib_gid {
> > } global;
> >  };
> >
> > +extern union ib_gid zgid;
> > +
> > +enum ib_gid_type {
> > +   /* If link layer is Ethernet, this is RoCE V1 */
> 
> I don't understand this comment.  Does RoCE v2 not run on Ethernet?
> 
Yes, this comment probably could use a reword..
> > +   IB_GID_TYPE_IB= 0,
> > +   IB_GID_TYPE_ROCE_V2   = 1,
> > +   IB_GID_TYPE_SIZE
> > +};
> 
> Can you explain the purpose of defining a 'GID type'.  A GID is just a global
> address.  Why does it matter to anyone using it how it was constructed?

This is part of RoCE V2 Specification.  Please refer to Section A 17.8 . 
The GID Type determines the protocol for outbound packet generation i.e RoCE V1 
(0x8915 Ether Type) or RoCEV2 (IPv4 or IPv6)
> 
> > +
> > +struct ib_gid_attr {
> > +   enum ib_gid_typegid_type;
> > +   struct net_device   *ndev;
> > +};
> > +
> > +struct ib_roce_gid_cache_entry {
> > +   /* seq number of 0 indicates entry being changed. */
> > +   unsigned intseq;
> > +   union ib_gidgid;
> > +   struct ib_gid_attr  attr;
> > +   void   *context;
> > +};
> > +
> > +struct ib_roce_gid_cache {
> > +   int  active;
> > +   int  sz;
> > +   /* locking against multiple writes in data_vec */
> > +   struct mutex lock;
> > +   struct ib_roce_gid_cache_entry *data_vec; };
> > +
> >  enum rdma_node_type {
> > /* IB values map to NodeInfo:NodeType. */
> > RDMA_NODE_IB_CA = 1,
> > @@ -265,7 +295,9 @@ enum ib_port_cap_flags {
> > IB_PORT_BOOT_MGMT_SUP   = 1 << 23,
> > IB_PORT_LINK_LATENCY_SUP= 1 << 24,
> > IB_PORT_CLIENT_REG_SUP  = 1 << 25,
> > -   IB_PORT_IP_BASED_GIDS   = 1 << 26
> > +   IB_PORT_IP_BASED_GIDS   = 1 << 26,
> > +   IB_PORT_ROCE= 1 << 27,
> > +   IB_PORT_ROCE_V2 = 1 << 28,
> 
> Why does RoCE suddenly require a port capability bit?  RoCE runs today
> without setting any bit.
Again, this is part of RoCE V2 SPEC, please refer to Section A17.5.1- Query 
HCA(Pasting snippet below)
A new "RoCE Supported" capability bit shall be added to the Port Attributes
list. This capability bit applies exclusively to ports of the new
"RoCEv2" type


Thanks
Som
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-07 Thread Hefty, Sean
> In order to manage multiple types, vlans and MACs per GID, we
> need to store them along the GID itself. We store the net device
> as well, as sometimes GIDs should be handled according to the
> net device they came from. Since populating the GID table should
> be identical for every RoCE provider, the GIDs table should be
> handled in ib_core.
> 
> Adding a GID cache table that supports a lockless find, add and
> delete gids. The lockless nature comes from using a unique
> sequence number per table entry and detecting that while reading/
> writing this sequence wasn't changed.
> 
> By using this RoCE GID cache table, providers must implement a
> modify_gid callback. The table is managed exclusively by
> this roce_gid_cache and the provider just need to write
> the data to the hardware.
> 
> Signed-off-by: Matan Barak 
> Signed-off-by: Somnath Kotur 
> ---
>  drivers/infiniband/core/Makefile |   3 +-
>  drivers/infiniband/core/core_priv.h  |  24 ++
>  drivers/infiniband/core/roce_gid_cache.c | 518

Why does RoCE need such a complex gid cache?  If a gid cache is needed at all, 
why should it be restricted to RoCE only?  And why is such a complex 
synchronization scheme needed?  Seriously, how many times will GIDs change and 
how many readers at once do you expect to have?


> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 65994a1..1866595 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -64,6 +64,36 @@ union ib_gid {
>   } global;
>  };
> 
> +extern union ib_gid zgid;
> +
> +enum ib_gid_type {
> + /* If link layer is Ethernet, this is RoCE V1 */

I don't understand this comment.  Does RoCE v2 not run on Ethernet?

> + IB_GID_TYPE_IB= 0,
> + IB_GID_TYPE_ROCE_V2   = 1,
> + IB_GID_TYPE_SIZE
> +};

Can you explain the purpose of defining a 'GID type'.  A GID is just a global 
address.  Why does it matter to anyone using it how it was constructed?

> +
> +struct ib_gid_attr {
> + enum ib_gid_typegid_type;
> + struct net_device   *ndev;
> +};
> +
> +struct ib_roce_gid_cache_entry {
> + /* seq number of 0 indicates entry being changed. */
> + unsigned intseq;
> + union ib_gidgid;
> + struct ib_gid_attr  attr;
> + void   *context;
> +};
> +
> +struct ib_roce_gid_cache {
> + int  active;
> + int  sz;
> + /* locking against multiple writes in data_vec */
> + struct mutex lock;
> + struct ib_roce_gid_cache_entry *data_vec;
> +};
> +
>  enum rdma_node_type {
>   /* IB values map to NodeInfo:NodeType. */
>   RDMA_NODE_IB_CA = 1,
> @@ -265,7 +295,9 @@ enum ib_port_cap_flags {
>   IB_PORT_BOOT_MGMT_SUP   = 1 << 23,
>   IB_PORT_LINK_LATENCY_SUP= 1 << 24,
>   IB_PORT_CLIENT_REG_SUP  = 1 << 25,
> - IB_PORT_IP_BASED_GIDS   = 1 << 26
> + IB_PORT_IP_BASED_GIDS   = 1 << 26,
> + IB_PORT_ROCE= 1 << 27,
> + IB_PORT_ROCE_V2 = 1 << 28,

Why does RoCE suddenly require a port capability bit?  RoCE runs today without 
setting any bit.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-03-26 Thread Somnath Kotur
Hi Matan/Moni,
Could either of you please respond to both of Bart's 
queries?

Thanks
Somnath

> -Original Message-
> From: Bart Van Assche [mailto:bart.vanass...@sandisk.com]
> Sent: Thursday, March 26, 2015 5:13 AM
> To: Somnath Kotur; rol...@kernel.org
> Cc: linux-rdma@vger.kernel.org; Matan Barak
> Subject: Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
> 
> On 03/25/2015 02:19 PM, Somnath Kotur wrote:
> > +   if (cache->data_vec[ix].attr.ndev &&
> > +   cache->data_vec[ix].attr.ndev != old_net_dev)
> 
> A few lines earlier the memory old_net_dev points at was freed. If two
> instances of this function run concurrently, what prevents that the
> old_net_dev memory has been reallocated and hence that attr.ndev ==
> old_net_dev although both pointers refer(red) to different network devices
> ?
> 
> > +   ACCESS_ONCE(cache->data_vec[ix].seq) = orig_seq;
> 
> Invoking write_gid() is only safe if the caller serializes write_gid() calls.
> Apparently the cache->lock mutex is used for that purpose. So why is it
> necessary to use ACCESS_ONCE() here ? Why is it needed to prevent that
> the compiler coalesces this write with another write into the same structure
> ?
> 
> > +   /* Make sure the sequence number we remeber was read
> 
> This looks like a typo - shouldn't the above read "remember" ?
> 
> BTW, the style of that comment is recommended only for networking code
> and not for IB code. Have you verified this patch with checkpatch ?
> 
> > +   mutex_lock(&cache->lock);
> > +
> > +   for (ix = 0; ix < cache->sz; ix++)
> > +   if (cache->data_vec[ix].attr.ndev == ndev)
> > +   write_gid(ib_dev, port, cache, ix, &zgid, &zattr);
> > +
> > +   mutex_unlock(&cache->lock);
> > +   return 0;
> 
> The traditional Linux kernel coding style is one blank line before
> mutex_lock() and after mutex_unlock() but not after mutex_lock() nor
> before mutex_unlock().
> 
> > +   orig_seq = ACCESS_ONCE(cache->data_vec[index].seq);
> > +   /* Make sure we read the sequence number before copying the
> > +* gid to local storage. */
> > +   smp_rmb();
> 
> Please use READ_ONCE() instead of ACCESS_ONCE() as recommended in
> .
> 
> > +static void free_roce_gid_cache(struct ib_device *ib_dev, u8 port) {
> > +   int i;
> > +   struct ib_roce_gid_cache *cache =
> > +   ib_dev->cache.roce_gid_cache[port - 1];
> > +
> > +   if (!cache)
> > +   return;
> > +
> > +   for (i = 0; i < cache->sz; ++i) {
> > +   if (memcmp(&cache->data_vec[i].gid, &zgid,
> > +  sizeof(cache->data_vec[i].gid)))
> > +   write_gid(ib_dev, port, cache, i, &zgid, &zattr);
> > +   }
>  > +  kfree(cache->data_vec);
>  > +  kfree(cache);
>  > +}
> 
> Overwriting data just before it is freed is not useful. Please use
> CONFIG_SLUB_DEBUG=y to debug use-after-free issues instead of such
> code.
> 
> Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-03-25 Thread Bart Van Assche

On 03/25/2015 02:19 PM, Somnath Kotur wrote:

+   if (cache->data_vec[ix].attr.ndev &&
+   cache->data_vec[ix].attr.ndev != old_net_dev)


A few lines earlier the memory old_net_dev points at was freed. If two 
instances of this function run concurrently, what prevents that the 
old_net_dev memory has been reallocated and hence that attr.ndev == 
old_net_dev although both pointers refer(red) to different network devices ?



+   ACCESS_ONCE(cache->data_vec[ix].seq) = orig_seq;


Invoking write_gid() is only safe if the caller serializes write_gid() 
calls. Apparently the cache->lock mutex is used for that purpose. So why 
is it necessary to use ACCESS_ONCE() here ? Why is it needed to prevent 
that the compiler coalesces this write with another write into the same 
structure ?



+   /* Make sure the sequence number we remeber was read


This looks like a typo - shouldn't the above read "remember" ?

BTW, the style of that comment is recommended only for networking code 
and not for IB code. Have you verified this patch with checkpatch ?



+   mutex_lock(&cache->lock);
+
+   for (ix = 0; ix < cache->sz; ix++)
+   if (cache->data_vec[ix].attr.ndev == ndev)
+   write_gid(ib_dev, port, cache, ix, &zgid, &zattr);
+
+   mutex_unlock(&cache->lock);
+   return 0;


The traditional Linux kernel coding style is one blank line before 
mutex_lock() and after mutex_unlock() but not after mutex_lock() nor 
before mutex_unlock().



+   orig_seq = ACCESS_ONCE(cache->data_vec[index].seq);
+   /* Make sure we read the sequence number before copying the
+* gid to local storage. */
+   smp_rmb();


Please use READ_ONCE() instead of ACCESS_ONCE() as recommended in 
.



+static void free_roce_gid_cache(struct ib_device *ib_dev, u8 port)
+{
+   int i;
+   struct ib_roce_gid_cache *cache =
+   ib_dev->cache.roce_gid_cache[port - 1];
+
+   if (!cache)
+   return;
+
+   for (i = 0; i < cache->sz; ++i) {
+   if (memcmp(&cache->data_vec[i].gid, &zgid,
+  sizeof(cache->data_vec[i].gid)))
+   write_gid(ib_dev, port, cache, i, &zgid, &zattr);
+   }

> +  kfree(cache->data_vec);
> +  kfree(cache);
> +}

Overwriting data just before it is freed is not useful. Please use 
CONFIG_SLUB_DEBUG=y to debug use-after-free issues instead of such code.


Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html