Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
On Tue, Apr 28, 2015 at 10:17 AM, Matan Barak wrote: > On 4/27/2015 9:22 PM, Or Gerlitz wrote: > I think the real question is why to deal with RCUs that will require > re-allocation of entries when it's not necessary or why do we want to use > rwlock if the kernel provides a mechanism (called seqcount) that fits this > problem better? > I disagree about seqcount being complex - if you look at its API you'll find > it's a lot simpler than RCU. I took a 2nd look, seqcount is indeed way simpler from RCU, and by itself is simple to use if you feel this provides better solution vs. simple rwlock, so I'm good with that. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
On 4/27/2015 9:22 PM, Or Gerlitz wrote: On Mon, Apr 27, 2015 at 10:32 AM, Matan Barak wrote: On 4/26/2015 8:20 PM, Or Gerlitz wrote: On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur wrote: From: Matan Barak In order to manage multiple types, vlans and MACs per GID, we need to store them along the GID itself. We store the net device as well, as sometimes GIDs should be handled according to the net device they came from. Since populating the GID table should be identical for every RoCE provider, the GIDs table should be handled in ib_core. Adding a GID cache table that supports a lockless find, add and delete gids. The lockless nature comes from using a unique sequence number per table entry and detecting that while reading/ writing this sequence wasn't changed. Matan, please use existing mechanism which fits the problem you are trying to solve, I guess one of RCU or seqlock should do the job. seqcount fits this problem better. Since if a write and read are done in parallel, there's a good chance we read an out of date entry and we are going to use a GID entry that's going to change in T+epsilon, so RCU doesn't really have an advantage here. So going back to the problem... we are talking on applications/drivers that attempt to establish new connections doing reads and writes done on behalf of IP stack changes, both are very much not critical path. So this is kind of similar to the neighbour table maintained by ND subsystem which is used by all IP based networking applications and that code uses RCU. I don't see what's wrong with RCU for our sort smaller scale subsystem and what is even wrong with simple rwlock which is the mechanism used today by the IB core git cache, this goes too complex and for no reason that I can think of. I think the real question is why to deal with RCUs that will require re-allocation of entries when it's not necessary or why do we want to use rwlock if the kernel provides a mechanism (called seqcount) that fits this problem better? I disagree about seqcount being complex - if you look at its API you'll find it's a lot simpler than RCU. The current implementation is a bit more efficient than seqcount, as it allows early termination of read-while-write (because the write puts a known "currently updating" value that the read knows to ignore). AFAIK, this doesn't exist in the current seqcount implementation. However, since this isn't a crucial data-path, I'll change that to seqcount. seqcount is preferred over seqlock, as I don't need the spinlock in seqlock. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
On Mon, Apr 27, 2015 at 10:32 AM, Matan Barak wrote: > On 4/26/2015 8:20 PM, Or Gerlitz wrote: >> On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur >> wrote: >>> From: Matan Barak >>> In order to manage multiple types, vlans and MACs per GID, we >>> need to store them along the GID itself. We store the net device >>> as well, as sometimes GIDs should be handled according to the >>> net device they came from. Since populating the GID table should >>> be identical for every RoCE provider, the GIDs table should be >>> handled in ib_core. >>> >>> Adding a GID cache table that supports a lockless find, add and >>> delete gids. The lockless nature comes from using a unique >>> sequence number per table entry and detecting that while reading/ >>> writing this sequence wasn't changed. >> Matan, please use existing mechanism which fits the problem you are >> trying to solve, I guess one of RCU or seqlock should do the job. > seqcount fits this problem better. Since if a write and read are done in > parallel, there's a good chance we read an out of date entry and we are > going to use a GID entry that's going to change in T+epsilon, so RCU doesn't > really have an advantage here. So going back to the problem... we are talking on applications/drivers that attempt to establish new connections doing reads and writes done on behalf of IP stack changes, both are very much not critical path. So this is kind of similar to the neighbour table maintained by ND subsystem which is used by all IP based networking applications and that code uses RCU. I don't see what's wrong with RCU for our sort smaller scale subsystem and what is even wrong with simple rwlock which is the mechanism used today by the IB core git cache, this goes too complex and for no reason that I can think of. > The current implementation is a bit more efficient than seqcount, as it > allows early termination of read-while-write (because the write puts a known > "currently updating" value that the read knows to ignore). AFAIK, this > doesn't exist in the current seqcount implementation. However, since this > isn't a crucial data-path, I'll change that to seqcount. > > seqcount is preferred over seqlock, as I don't need the spinlock in seqlock. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
On 4/26/2015 8:20 PM, Or Gerlitz wrote: On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur wrote: From: Matan Barak In order to manage multiple types, vlans and MACs per GID, we need to store them along the GID itself. We store the net device as well, as sometimes GIDs should be handled according to the net device they came from. Since populating the GID table should be identical for every RoCE provider, the GIDs table should be handled in ib_core. Adding a GID cache table that supports a lockless find, add and delete gids. The lockless nature comes from using a unique sequence number per table entry and detecting that while reading/ writing this sequence wasn't changed. Matan, please use existing mechanism which fits the problem you are trying to solve, I guess one of RCU or seqlock should do the job. seqcount fits this problem better. Since if a write and read are done in parallel, there's a good chance we read an out of date entry and we are going to use a GID entry that's going to change in T+epsilon, so RCU doesn't really have an advantage here. The current implementation is a bit more efficient than seqcount, as it allows early termination of read-while-write (because the write puts a known "currently updating" value that the read knows to ignore). AFAIK, this doesn't exist in the current seqcount implementation. However, since this isn't a crucial data-path, I'll change that to seqcount. seqcount is preferred over seqlock, as I don't need the spinlock in seqlock. Or. Matan -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
On Thu, Mar 26, 2015 at 12:19 AM, Somnath Kotur wrote: > From: Matan Barak > > In order to manage multiple types, vlans and MACs per GID, we > need to store them along the GID itself. We store the net device > as well, as sometimes GIDs should be handled according to the > net device they came from. Since populating the GID table should > be identical for every RoCE provider, the GIDs table should be > handled in ib_core. > > Adding a GID cache table that supports a lockless find, add and > delete gids. The lockless nature comes from using a unique > sequence number per table entry and detecting that while reading/ > writing this sequence wasn't changed. Matan, please use existing mechanism which fits the problem you are trying to solve, I guess one of RCU or seqlock should do the job. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
> > > The RoCE Verbs interface references the HCA GID table in QPs and AHs, > > for all RoCE versions. > > The IBTA specifically does not define software interfaces. The concern here > is > the architecture and definition of the linux rdma software stack, not verbs, > despite the fact that the layer is named that. > Whatever. The ib_* API is the Linux implementation for IBTA Verbs (both Infiniband and RoCE flavors) and iWARP Verbs. This interface uses a GID table. In this patch-set, there is no conceptual change in this regard. It adds GID types, as stated by the RoCE specification, which is a natural extension to the existing code. I am sure that we can consider additional changes in the future, but one patch-set at a time... > > As I said earlier, the network layer (e.g., IPv4, IPv6, or GRH) cannot > > be a port attribute because RoCE HCAs support all of them over the > > same physical port. > > IMO, the architecture needs to change to support multiple network and > transport layers over the same link layer. The current restriction of 1 > "transport" per port is too restrictive, and doesn't seem to be true even for > current devices, like usnic. > usnic uses UDP transport. Infiniband and RoCE use the IB transport (all the BTH+ header stuff as defined by IBTA). iWARP uses IETF defined protocols over TCP, namely RDMAP+DDP+MPA. I am not aware of any device that mixes these transports today. If we ever see such a need in the future, let's talk again. Currently, this is irrelevant. > > Maybe we should change these patches to encode the port "summary" as a > > bit mask, and provide convenient masks for queries? > > Switching to bit masks is the longer term objective. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
> The RoCE Verbs interface references the HCA GID table in QPs and AHs, for > all RoCE versions. The IBTA specifically does not define software interfaces. The concern here is the architecture and definition of the linux rdma software stack, not verbs, despite the fact that the layer is named that. > As I said earlier, the network layer (e.g., IPv4, IPv6, or GRH) cannot be > a port attribute because RoCE HCAs support all of them over the same > physical port. IMO, the architecture needs to change to support multiple network and transport layers over the same link layer. The current restriction of 1 "transport" per port is too restrictive, and doesn't seem to be true even for current devices, like usnic. > Maybe we should change these patches to encode the port "summary" as a bit > mask, and provide convenient masks for queries? Switching to bit masks is the longer term objective. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
> > > RoCE v2 is really Infiniband over UDP over IP. Why don't we just > > > call > > it IBoUDP like it is? > > RoCEv2 is the name in the IBTA spec (Annex 17) > > We call RoCE IBoE in the kernel, because that's what it is. RoCE is an IBTA > marketing name. > > Looking through the Annex, I don't see where Ethernet is even a requirement > for this technology to work. The IB transport is layered over a standard UDP > header. I do see where the spec calls out updating the IP header, but that's > it. > > Regardless of what it's called, it replaces the underlying network and > transport protocols, versus IB-classic or IBoE/RoCE. That should be captured > properly, not by saying there's a new GID type. RoCE v2 doesn't even use > GIDs as part of its protocol. It uses UDP/IP addresses. > > The RoCE Verbs interface references the HCA GID table in QPs and AHs, for all RoCE versions. The specification mandates that the Verbs consumer may use the following protocols over the same RoCE-capable HCA and the same physical port: - RoCEv1 (L2, IB GRH, IB BTH+, payload) - RoCEv2 using IPv4 (L2, IPv4, UDP, IB BTH+, payload) - RoCEv2 using IPv6 (L2, IPv6, UDP, IB BTH+, payload) The distinction (by spec) is done by associating a GID type attribute to each GID table entry, which is either IB, IPv4, or IPv6. This is how apps can create different RC QPs using these different wire protocols, or a single UD QP that can send packets to all of these wire protocols. Perhaps we could add another enum entry for RoCEv1 in the patch to make this clearer. > > > IBoUDP changes the Ethertype, replaces the network header, adds a > > > new > > transport protocol header, and layers IB over that. This change > > should be exposed properly and not as just a new GID type. > > I don't understand what do you suggest here. Can you give an example? > > I don't have a solution here. Please look at Michael Wang's patch series and > see how this would fit into that model. The introduction of iWarp required > defining a new 'transport' type. IBoE added a new link layer. Based on those > changes, this would warrant introducing a new network layer, so that it can > be distinguished properly from the other options. Maybe that's the right > approach? > The "new-transport" in Michael's patches doesn't refer to the network transport layer, but rather acts as a summary of the tuple (the network layer is indeed skipped). The network transport layer of Infiniband and *all* RoCE types is the same: IB transport. As I said earlier, the network layer (e.g., IPv4, IPv6, or GRH) cannot be a port attribute because RoCE HCAs support all of them over the same physical port. Maybe we should change these patches to encode the port "summary" as a bit mask, and provide convenient masks for queries? Alternatively, we could leave the port qualifiers as they are (i.e., with distinct qualifiers>) instead of introducing "new-transport", but provide convenient wrappers for ULPs to use. For example: - is_roce() /* returns true for all RoCE wire protocols */ - is_rocev1() - is_rocev2() - is_iwarp() - ... > Cisco's NIC reports a transport layer of 'USNIC_UDP', which should really just > be 'UDP'. That NIC supports UDP/IP/Ethernet, based on the rdma stack's > model. > RoCE v2 is also UDP/IP/Ethernet, it only layers IB over that. (This makes the > use of the term 'transport' confusing. Maybe there should also be a 'session' > protocol?) It seems completely reasonable that a device which does > IB/UDP/IP/Ethernet to someday expose the UDP/IP/Ethernet portion (if it > doesn't already), and from the same port at the same time. RoCEv2 uses UDP only as an encapsulation protocol (just like VXLAN). The transport is well-defined: IBTA transport (just like Infiniband). In the USNIC case, UPD is indeed the actual network transport. > > Rather than continuing to try to make everything look like an IB-classic > device > because it's convenient, the stack needs to start exposing things properly. I > don't know what the right solution should be, but trying to capture this level > of detail as a different GID type definitely looks like a step in the wrong > direction. As mentioned above, this is how Verbs consumers send traffic using the different wire-protocols over the same physical port. This patch-set follows the specification, and cleanly provides all per-GID associations in 'struct ib_gid_attr'. So basically, in RoCE, a GID entry represents a network interface (a netdev) to the HCA, and encompasses all information related with that interface: MAC, VLAN, MTU, IP address, namespace, etc. This also makes it straight forward to add future extensions. > > - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
> > RoCE v2 is really Infiniband over UDP over IP. Why don't we just call > it IBoUDP like it is? > RoCEv2 is the name in the IBTA spec (Annex 17) We call RoCE IBoE in the kernel, because that's what it is. RoCE is an IBTA marketing name. Looking through the Annex, I don't see where Ethernet is even a requirement for this technology to work. The IB transport is layered over a standard UDP header. I do see where the spec calls out updating the IP header, but that's it. Regardless of what it's called, it replaces the underlying network and transport protocols, versus IB-classic or IBoE/RoCE. That should be captured properly, not by saying there's a new GID type. RoCE v2 doesn't even use GIDs as part of its protocol. It uses UDP/IP addresses. > > IBoUDP changes the Ethertype, replaces the network header, adds a new > transport protocol header, and layers IB over that. This change should be > exposed properly and not as just a new GID type. > I don't understand what do you suggest here. Can you give an example? I don't have a solution here. Please look at Michael Wang's patch series and see how this would fit into that model. The introduction of iWarp required defining a new 'transport' type. IBoE added a new link layer. Based on those changes, this would warrant introducing a new network layer, so that it can be distinguished properly from the other options. Maybe that's the right approach? Cisco's NIC reports a transport layer of 'USNIC_UDP', which should really just be 'UDP'. That NIC supports UDP/IP/Ethernet, based on the rdma stack's model. RoCE v2 is also UDP/IP/Ethernet, it only layers IB over that. (This makes the use of the term 'transport' confusing. Maybe there should also be a 'session' protocol?) It seems completely reasonable that a device which does IB/UDP/IP/Ethernet to someday expose the UDP/IP/Ethernet portion (if it doesn't already), and from the same port at the same time. Rather than continuing to try to make everything look like an IB-classic device because it's convenient, the stack needs to start exposing things properly. I don't know what the right solution should be, but trying to capture this level of detail as a different GID type definitely looks like a step in the wrong direction. - Sean
Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
On Wed, Apr 15, 2015 at 7:08 PM, Hefty, Sean wrote: >> It does not break every app, the choice of which GID type to use is made >> by the RDMA-CM based on network topology hint obtained from the IP stack. >> Please refer to patch 15/33: IB/Core: Changes to the IB Core >> infrastructure for RoCEv2 support. >> Of course, if the user does not want to go with this choice made by the >> RDMA-CM, then there is the option of overriding it using the configfs >> patch (PATCH 14/33) >> Hope that clarifies? > > RoCE v2 is really Infiniband over UDP over IP. Why don't we just call it > IBoUDP like it is? RoCEv2 is the name in the IBTA spec (Annex 17) > > IBoUDP changes the Ethertype, replaces the network header, adds a new > transport protocol header, and layers IB over that. This change should be > exposed properly and not as just a new GID type. I don't understand what do you suggest here. Can you give an example? > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
AFAIK, RoCE v2 is the known and official name. Why would we like to come up with a customized name? These are indeed two different protocols, thus the comparison to DHCP assigned addresses and static addresses is (to say the least) a bit off. Even when comparing IPv4 and IPv6, the most significant user change is the sockaddr_in and sockaddr_in6 addresses. IMHO, since the GID format is identical, the changes could be encoded in a gid_type argument. The gid_type is an inherent part of the address, making identical gids with different gid_types as two different addresses, as expected. On 4/15/2015 7:21 PM, Suri Shelvapille wrote: IMHO, it would be good to have a physical layer representation in the naming convention. -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Hefty, Sean Sent: Wednesday, April 15, 2015 12:08 PM To: Somnath Kotur; Matan Barak; rol...@kernel.org Cc: linux-rdma@vger.kernel.org Subject: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache It does not break every app, the choice of which GID type to use is made by the RDMA-CM based on network topology hint obtained from the IP stack. Please refer to patch 15/33: IB/Core: Changes to the IB Core infrastructure for RoCEv2 support. Of course, if the user does not want to go with this choice made by the RDMA-CM, then there is the option of overriding it using the configfs patch (PATCH 14/33) Hope that clarifies? RoCE v2 is really Infiniband over UDP over IP. Why don't we just call it IBoUDP like it is? IBoUDP changes the Ethertype, replaces the network header, adds a new transport protocol header, and layers IB over that. This change should be exposed properly and not as just a new GID type. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html This correspondence, and any attachments or files transmitted with this correspondence, contains information which may be confidential and privileged and is intended solely for the use of the addressee. Unless you are the addressee or are authorized to receive messages for the addressee, you may not use, copy, disseminate, or disclose this correspondence or any information contained in this correspondence to any third party. If you have received this correspondence in error, please notify the sender immediately and delete this correspondence and any attachments or files transmitted with this correspondence from your system, and destroy any and all copies thereof, electronic or otherwise. Your cooperation and understanding are greatly appreciated. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
IMHO, it would be good to have a physical layer representation in the naming convention. -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Hefty, Sean Sent: Wednesday, April 15, 2015 12:08 PM To: Somnath Kotur; Matan Barak; rol...@kernel.org Cc: linux-rdma@vger.kernel.org Subject: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache > It does not break every app, the choice of which GID type to use is > made by the RDMA-CM based on network topology hint obtained from the IP stack. > Please refer to patch 15/33: IB/Core: Changes to the IB Core > infrastructure for RoCEv2 support. > Of course, if the user does not want to go with this choice made by > the RDMA-CM, then there is the option of overriding it using the > configfs patch (PATCH 14/33) Hope that clarifies? RoCE v2 is really Infiniband over UDP over IP. Why don't we just call it IBoUDP like it is? IBoUDP changes the Ethertype, replaces the network header, adds a new transport protocol header, and layers IB over that. This change should be exposed properly and not as just a new GID type. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html This correspondence, and any attachments or files transmitted with this correspondence, contains information which may be confidential and privileged and is intended solely for the use of the addressee. Unless you are the addressee or are authorized to receive messages for the addressee, you may not use, copy, disseminate, or disclose this correspondence or any information contained in this correspondence to any third party. If you have received this correspondence in error, please notify the sender immediately and delete this correspondence and any attachments or files transmitted with this correspondence from your system, and destroy any and all copies thereof, electronic or otherwise. Your cooperation and understanding are greatly appreciated. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
> It does not break every app, the choice of which GID type to use is made > by the RDMA-CM based on network topology hint obtained from the IP stack. > Please refer to patch 15/33: IB/Core: Changes to the IB Core > infrastructure for RoCEv2 support. > Of course, if the user does not want to go with this choice made by the > RDMA-CM, then there is the option of overriding it using the configfs > patch (PATCH 14/33) > Hope that clarifies? RoCE v2 is really Infiniband over UDP over IP. Why don't we just call it IBoUDP like it is? IBoUDP changes the Ethertype, replaces the network header, adds a new transport protocol header, and layers IB over that. This change should be exposed properly and not as just a new GID type. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
> -Original Message- > From: Hefty, Sean [mailto:sean.he...@intel.com] > Sent: Tuesday, April 14, 2015 11:02 PM > To: Matan Barak; Somnath Kotur; rol...@kernel.org > Cc: linux-rdma@vger.kernel.org > Subject: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache > > > This is a part of the GID meta info. The user should be able to choose > > between RoCE V1 (which is represented here by IB_GID_TYPE_IB) and > RoCE > > V2 - just as a user could choose between IPv6 and IPv4. > > IPv4 and IPv6 are different protocols, not different formats for the same > address. How does RoCE v2 not break every app? It does not break every app, the choice of which GID type to use is made by the RDMA-CM based on network topology hint obtained from the IP stack. Please refer to patch 15/33: IB/Core: Changes to the IB Core infrastructure for RoCEv2 support. Of course, if the user does not want to go with this choice made by the RDMA-CM, then there is the option of overriding it using the configfs patch (PATCH 14/33) Hope that clarifies? Thanks Som -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
> This is a part of the GID meta info. The user should be able to choose > between RoCE V1 (which is represented here by IB_GID_TYPE_IB) and RoCE > V2 - just as a user could choose between IPv6 and IPv4. IPv4 and IPv6 are different protocols, not different formats for the same address. How does RoCE v2 not break every app? This isn't like asking the user to choose between IPv4 versus IPv6, it's asking them to choose between IPv4 assigned by DHCP versus IPv4 assigned statically. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
On 04/14/15 15:23, Matan Barak wrote: +mutex_lock(&cache->lock); + +for (ix = 0; ix < cache->sz; ix++) +if (cache->data_vec[ix].attr.ndev == ndev) +write_gid(ib_dev, port, cache, ix, &zgid, &zattr); + +mutex_unlock(&cache->lock); +return 0; The traditional Linux kernel coding style is one blank line before mutex_lock() and after mutex_unlock() but not after mutex_lock() nor before mutex_unlock(). I didn't find this in the CodingStyle doc. Could you please quote or post a link? Hello Matan, I'm not aware of any formal documentation of this style guideline. But if you look around in the Linux kernel tree you will see that most kernel code follows this style. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
On 3/26/2015 1:42 AM, Bart Van Assche wrote: On 03/25/2015 02:19 PM, Somnath Kotur wrote: +if (cache->data_vec[ix].attr.ndev && +cache->data_vec[ix].attr.ndev != old_net_dev) A few lines earlier the memory old_net_dev points at was freed. If two instances of this function run concurrently, what prevents that the old_net_dev memory has been reallocated and hence that attr.ndev == old_net_dev although both pointers refer(red) to different network devices ? write_gid is *almost* always called in a mutex. The only case it's not protected is in free_roce_gid_cache. free_roce_gid_cache is called only in the error flow of roce_gid_cache_setup_one, when no concurrent write_gid could be made (as the cache isn't setup yet) and in roce_gid_cache_cleanup_one. roce_gid_cache_client_setup_one is called in the error flow of roce_gid_cache_client_setup_one (where no other write_gid are expected for the same above reason) and in roce_gid_cache_client_cleanup_work_handler, where it's called after flush_workqueue(roce_gid_mgmt_wq). Since all write_gids are called through roce_gid_mgmt_wq and we set the cache to inactive mode before flushing the wq and freeing the cache, I think we can conclude no concurrent write_gid on the same cache are possible. +ACCESS_ONCE(cache->data_vec[ix].seq) = orig_seq; Invoking write_gid() is only safe if the caller serializes write_gid() calls. Apparently the cache->lock mutex is used for that purpose. So why is it necessary to use ACCESS_ONCE() here ? Why is it needed to prevent that the compiler coalesces this write with another write into the same structure ? The mutex only serializes cache writes. Cache reads could be done in concurrent with writes and are protected by the ACCESS_ONCE. +/* Make sure the sequence number we remeber was read This looks like a typo - shouldn't the above read "remember" ? Will be fixed in V4, thanks. BTW, the style of that comment is recommended only for networking code and not for IB code. Have you verified this patch with checkpatch ? Of course and I've just re-run checkpatch on this patch. It doesn't catch this. +mutex_lock(&cache->lock); + +for (ix = 0; ix < cache->sz; ix++) +if (cache->data_vec[ix].attr.ndev == ndev) +write_gid(ib_dev, port, cache, ix, &zgid, &zattr); + +mutex_unlock(&cache->lock); +return 0; The traditional Linux kernel coding style is one blank line before mutex_lock() and after mutex_unlock() but not after mutex_lock() nor before mutex_unlock(). I didn't find this in the CodingStyle doc. Could you please quote or post a link? +orig_seq = ACCESS_ONCE(cache->data_vec[index].seq); +/* Make sure we read the sequence number before copying the + * gid to local storage. */ +smp_rmb(); Please use READ_ONCE() instead of ACCESS_ONCE() as recommended in . Ok, I'll change that in V4. I see that READ_ONCE and WRITE_ONCE is different from ACCESS_ONCE only for aggregated data types (which isn't our case), but it won't hurt to change that. +static void free_roce_gid_cache(struct ib_device *ib_dev, u8 port) +{ +int i; +struct ib_roce_gid_cache *cache = +ib_dev->cache.roce_gid_cache[port - 1]; + +if (!cache) +return; + +for (i = 0; i < cache->sz; ++i) { +if (memcmp(&cache->data_vec[i].gid, &zgid, + sizeof(cache->data_vec[i].gid))) +write_gid(ib_dev, port, cache, i, &zgid, &zattr); +} > +kfree(cache->data_vec); > +kfree(cache); > +} Overwriting data just before it is freed is not useful. Please use CONFIG_SLUB_DEBUG=y to debug use-after-free issues instead of such code. It's mandatory as write_gid with zgid might cause the vendor driver to free memory it allocated for this GID entry (like in the mlx4 case). Bart. Thanks for the review :) Matan -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
On 4/14/2015 2:50 AM, Hefty, Sean wrote: Yes, this comment probably could use a reword.. + IB_GID_TYPE_IB= 0, + IB_GID_TYPE_ROCE_V2 = 1, + IB_GID_TYPE_SIZE +}; Can you explain the purpose of defining a 'GID type'. A GID is just a global address. Why does it matter to anyone using it how it was constructed? This is part of RoCE V2 Specification. Please refer to Section A 17.8 . The GID Type determines the protocol for outbound packet generation i.e RoCE V1 (0x8915 Ether Type) or RoCEV2 (IPv4 or IPv6) This isn't an interface for the RoCE specification. Why does this need to be added to the verbs interface? It hasn't been needed by apps yet, and I don't see why the apps should be made to care now how the GID is formatted. This is a part of the GID meta info. The user should be able to choose between RoCE V1 (which is represented here by IB_GID_TYPE_IB) and RoCE V2 - just as a user could choose between IPv6 and IPv4. @@ -265,7 +295,9 @@ enum ib_port_cap_flags { IB_PORT_BOOT_MGMT_SUP = 1 << 23, IB_PORT_LINK_LATENCY_SUP= 1 << 24, IB_PORT_CLIENT_REG_SUP = 1 << 25, - IB_PORT_IP_BASED_GIDS = 1 << 26 + IB_PORT_IP_BASED_GIDS = 1 << 26, + IB_PORT_ROCE= 1 << 27, + IB_PORT_ROCE_V2 = 1 << 28, Why does RoCE suddenly require a port capability bit? RoCE runs today without setting any bit. Again, this is part of RoCE V2 SPEC, please refer to Section A17.5.1- Query HCA(Pasting snippet below) A new "RoCE Supported" capability bit shall be added to the Port Attributes list. This capability bit applies exclusively to ports of the new "RoCEv2" type Same comment as above. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
> Yes, this comment probably could use a reword.. > > > + IB_GID_TYPE_IB= 0, > > > + IB_GID_TYPE_ROCE_V2 = 1, > > > + IB_GID_TYPE_SIZE > > > +}; > > > > Can you explain the purpose of defining a 'GID type'. A GID is just a > global > > address. Why does it matter to anyone using it how it was constructed? > > This is part of RoCE V2 Specification. Please refer to Section A 17.8 . > The GID Type determines the protocol for outbound packet generation i.e > RoCE V1 (0x8915 Ether Type) or RoCEV2 (IPv4 or IPv6) This isn't an interface for the RoCE specification. Why does this need to be added to the verbs interface? It hasn't been needed by apps yet, and I don't see why the apps should be made to care now how the GID is formatted. > > > @@ -265,7 +295,9 @@ enum ib_port_cap_flags { > > > IB_PORT_BOOT_MGMT_SUP = 1 << 23, > > > IB_PORT_LINK_LATENCY_SUP= 1 << 24, > > > IB_PORT_CLIENT_REG_SUP = 1 << 25, > > > - IB_PORT_IP_BASED_GIDS = 1 << 26 > > > + IB_PORT_IP_BASED_GIDS = 1 << 26, > > > + IB_PORT_ROCE= 1 << 27, > > > + IB_PORT_ROCE_V2 = 1 << 28, > > > > Why does RoCE suddenly require a port capability bit? RoCE runs today > > without setting any bit. > Again, this is part of RoCE V2 SPEC, please refer to Section A17.5.1- > Query HCA(Pasting snippet below) > A new "RoCE Supported" capability bit shall be added to the Port > Attributes > list. This capability bit applies exclusively to ports of the new > "RoCEv2" type Same comment as above. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
On Wed, Apr 8, 2015 at 2:30 AM, Hefty, Sean wrote: >> In order to manage multiple types, vlans and MACs per GID, we >> need to store them along the GID itself. We store the net device >> as well, as sometimes GIDs should be handled according to the >> net device they came from. Since populating the GID table should >> be identical for every RoCE provider, the GIDs table should be >> handled in ib_core. >> >> Adding a GID cache table that supports a lockless find, add and >> delete gids. The lockless nature comes from using a unique >> sequence number per table entry and detecting that while reading/ >> writing this sequence wasn't changed. >> >> By using this RoCE GID cache table, providers must implement a >> modify_gid callback. The table is managed exclusively by >> this roce_gid_cache and the provider just need to write >> the data to the hardware. >> >> Signed-off-by: Matan Barak >> Signed-off-by: Somnath Kotur >> --- >> drivers/infiniband/core/Makefile | 3 +- >> drivers/infiniband/core/core_priv.h | 24 ++ >> drivers/infiniband/core/roce_gid_cache.c | 518 > > Why does RoCE need such a complex gid cache? If a gid cache is needed at > all, why should it be restricted to RoCE only? And why is such a complex > synchronization scheme needed? Seriously, how many times will GIDs change > and how many readers at once do you expect to have? > GID cache is also implemented for link layer IB. Howver, for RoCE the GID cache is also the manager of the table. This means that adding or removing entries from the GID table is under the responsibility of the cache and not the HW/device driver. This is a new scheme that frees each vendor's driver to deal with net and inet events. Content of the GID table is much more dynamic for RoCE than for IB and so is access to the table so I guess that extra mechanism is required. The fact that GID entry is associated with net_device and inet_addr objects that can be modified/deleted at any time is an example. > >> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h >> index 65994a1..1866595 100644 >> --- a/include/rdma/ib_verbs.h >> +++ b/include/rdma/ib_verbs.h >> @@ -64,6 +64,36 @@ union ib_gid { >> } global; >> }; >> >> +extern union ib_gid zgid; >> + >> +enum ib_gid_type { >> + /* If link layer is Ethernet, this is RoCE V1 */ > > I don't understand this comment. Does RoCE v2 not run on Ethernet? > >> + IB_GID_TYPE_IB= 0, >> + IB_GID_TYPE_ROCE_V2 = 1, >> + IB_GID_TYPE_SIZE >> +}; > > Can you explain the purpose of defining a 'GID type'. A GID is just a global > address. Why does it matter to anyone using it how it was constructed? > >> + >> +struct ib_gid_attr { >> + enum ib_gid_typegid_type; >> + struct net_device *ndev; >> +}; >> + >> +struct ib_roce_gid_cache_entry { >> + /* seq number of 0 indicates entry being changed. */ >> + unsigned intseq; >> + union ib_gidgid; >> + struct ib_gid_attr attr; >> + void *context; >> +}; >> + >> +struct ib_roce_gid_cache { >> + int active; >> + int sz; >> + /* locking against multiple writes in data_vec */ >> + struct mutex lock; >> + struct ib_roce_gid_cache_entry *data_vec; >> +}; >> + >> enum rdma_node_type { >> /* IB values map to NodeInfo:NodeType. */ >> RDMA_NODE_IB_CA = 1, >> @@ -265,7 +295,9 @@ enum ib_port_cap_flags { >> IB_PORT_BOOT_MGMT_SUP = 1 << 23, >> IB_PORT_LINK_LATENCY_SUP= 1 << 24, >> IB_PORT_CLIENT_REG_SUP = 1 << 25, >> - IB_PORT_IP_BASED_GIDS = 1 << 26 >> + IB_PORT_IP_BASED_GIDS = 1 << 26, >> + IB_PORT_ROCE= 1 << 27, >> + IB_PORT_ROCE_V2 = 1 << 28, > > Why does RoCE suddenly require a port capability bit? RoCE runs today > without setting any bit. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
Hi Sean, > -Original Message- > From: Hefty, Sean [mailto:sean.he...@intel.com] > Sent: Wednesday, April 08, 2015 6:00 AM > To: Somnath Kotur; rol...@kernel.org > Cc: linux-rdma@vger.kernel.org; Matan Barak > Subject: RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache > > > In order to manage multiple types, vlans and MACs per GID, we need to > > store them along the GID itself. We store the net device as well, as > > sometimes GIDs should be handled according to the net device they came > > from. Since populating the GID table should be identical for every > > RoCE provider, the GIDs table should be handled in ib_core. > > > > Adding a GID cache table that supports a lockless find, add and delete > > gids. The lockless nature comes from using a unique sequence number > > per table entry and detecting that while reading/ writing this > > sequence wasn't changed. > > > > By using this RoCE GID cache table, providers must implement a > > modify_gid callback. The table is managed exclusively by this > > roce_gid_cache and the provider just need to write the data to the > > hardware. > > > > Signed-off-by: Matan Barak > > Signed-off-by: Somnath Kotur > > --- > > drivers/infiniband/core/Makefile | 3 +- > > drivers/infiniband/core/core_priv.h | 24 ++ > > drivers/infiniband/core/roce_gid_cache.c | 518 > > Why does RoCE need such a complex gid cache? If a gid cache is needed at > all, why should it be restricted to RoCE only? And why is such a complex > synchronization scheme needed? Seriously, how many times will GIDs > change and how many readers at once do you expect to have? > > > > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index > > 65994a1..1866595 100644 > > --- a/include/rdma/ib_verbs.h > > +++ b/include/rdma/ib_verbs.h > > @@ -64,6 +64,36 @@ union ib_gid { > > } global; > > }; > > > > +extern union ib_gid zgid; > > + > > +enum ib_gid_type { > > + /* If link layer is Ethernet, this is RoCE V1 */ > > I don't understand this comment. Does RoCE v2 not run on Ethernet? > Yes, this comment probably could use a reword.. > > + IB_GID_TYPE_IB= 0, > > + IB_GID_TYPE_ROCE_V2 = 1, > > + IB_GID_TYPE_SIZE > > +}; > > Can you explain the purpose of defining a 'GID type'. A GID is just a global > address. Why does it matter to anyone using it how it was constructed? This is part of RoCE V2 Specification. Please refer to Section A 17.8 . The GID Type determines the protocol for outbound packet generation i.e RoCE V1 (0x8915 Ether Type) or RoCEV2 (IPv4 or IPv6) > > > + > > +struct ib_gid_attr { > > + enum ib_gid_typegid_type; > > + struct net_device *ndev; > > +}; > > + > > +struct ib_roce_gid_cache_entry { > > + /* seq number of 0 indicates entry being changed. */ > > + unsigned intseq; > > + union ib_gidgid; > > + struct ib_gid_attr attr; > > + void *context; > > +}; > > + > > +struct ib_roce_gid_cache { > > + int active; > > + int sz; > > + /* locking against multiple writes in data_vec */ > > + struct mutex lock; > > + struct ib_roce_gid_cache_entry *data_vec; }; > > + > > enum rdma_node_type { > > /* IB values map to NodeInfo:NodeType. */ > > RDMA_NODE_IB_CA = 1, > > @@ -265,7 +295,9 @@ enum ib_port_cap_flags { > > IB_PORT_BOOT_MGMT_SUP = 1 << 23, > > IB_PORT_LINK_LATENCY_SUP= 1 << 24, > > IB_PORT_CLIENT_REG_SUP = 1 << 25, > > - IB_PORT_IP_BASED_GIDS = 1 << 26 > > + IB_PORT_IP_BASED_GIDS = 1 << 26, > > + IB_PORT_ROCE= 1 << 27, > > + IB_PORT_ROCE_V2 = 1 << 28, > > Why does RoCE suddenly require a port capability bit? RoCE runs today > without setting any bit. Again, this is part of RoCE V2 SPEC, please refer to Section A17.5.1- Query HCA(Pasting snippet below) A new "RoCE Supported" capability bit shall be added to the Port Attributes list. This capability bit applies exclusively to ports of the new "RoCEv2" type Thanks Som -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
> In order to manage multiple types, vlans and MACs per GID, we > need to store them along the GID itself. We store the net device > as well, as sometimes GIDs should be handled according to the > net device they came from. Since populating the GID table should > be identical for every RoCE provider, the GIDs table should be > handled in ib_core. > > Adding a GID cache table that supports a lockless find, add and > delete gids. The lockless nature comes from using a unique > sequence number per table entry and detecting that while reading/ > writing this sequence wasn't changed. > > By using this RoCE GID cache table, providers must implement a > modify_gid callback. The table is managed exclusively by > this roce_gid_cache and the provider just need to write > the data to the hardware. > > Signed-off-by: Matan Barak > Signed-off-by: Somnath Kotur > --- > drivers/infiniband/core/Makefile | 3 +- > drivers/infiniband/core/core_priv.h | 24 ++ > drivers/infiniband/core/roce_gid_cache.c | 518 Why does RoCE need such a complex gid cache? If a gid cache is needed at all, why should it be restricted to RoCE only? And why is such a complex synchronization scheme needed? Seriously, how many times will GIDs change and how many readers at once do you expect to have? > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index 65994a1..1866595 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -64,6 +64,36 @@ union ib_gid { > } global; > }; > > +extern union ib_gid zgid; > + > +enum ib_gid_type { > + /* If link layer is Ethernet, this is RoCE V1 */ I don't understand this comment. Does RoCE v2 not run on Ethernet? > + IB_GID_TYPE_IB= 0, > + IB_GID_TYPE_ROCE_V2 = 1, > + IB_GID_TYPE_SIZE > +}; Can you explain the purpose of defining a 'GID type'. A GID is just a global address. Why does it matter to anyone using it how it was constructed? > + > +struct ib_gid_attr { > + enum ib_gid_typegid_type; > + struct net_device *ndev; > +}; > + > +struct ib_roce_gid_cache_entry { > + /* seq number of 0 indicates entry being changed. */ > + unsigned intseq; > + union ib_gidgid; > + struct ib_gid_attr attr; > + void *context; > +}; > + > +struct ib_roce_gid_cache { > + int active; > + int sz; > + /* locking against multiple writes in data_vec */ > + struct mutex lock; > + struct ib_roce_gid_cache_entry *data_vec; > +}; > + > enum rdma_node_type { > /* IB values map to NodeInfo:NodeType. */ > RDMA_NODE_IB_CA = 1, > @@ -265,7 +295,9 @@ enum ib_port_cap_flags { > IB_PORT_BOOT_MGMT_SUP = 1 << 23, > IB_PORT_LINK_LATENCY_SUP= 1 << 24, > IB_PORT_CLIENT_REG_SUP = 1 << 25, > - IB_PORT_IP_BASED_GIDS = 1 << 26 > + IB_PORT_IP_BASED_GIDS = 1 << 26, > + IB_PORT_ROCE= 1 << 27, > + IB_PORT_ROCE_V2 = 1 << 28, Why does RoCE suddenly require a port capability bit? RoCE runs today without setting any bit. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
Hi Matan/Moni, Could either of you please respond to both of Bart's queries? Thanks Somnath > -Original Message- > From: Bart Van Assche [mailto:bart.vanass...@sandisk.com] > Sent: Thursday, March 26, 2015 5:13 AM > To: Somnath Kotur; rol...@kernel.org > Cc: linux-rdma@vger.kernel.org; Matan Barak > Subject: Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache > > On 03/25/2015 02:19 PM, Somnath Kotur wrote: > > + if (cache->data_vec[ix].attr.ndev && > > + cache->data_vec[ix].attr.ndev != old_net_dev) > > A few lines earlier the memory old_net_dev points at was freed. If two > instances of this function run concurrently, what prevents that the > old_net_dev memory has been reallocated and hence that attr.ndev == > old_net_dev although both pointers refer(red) to different network devices > ? > > > + ACCESS_ONCE(cache->data_vec[ix].seq) = orig_seq; > > Invoking write_gid() is only safe if the caller serializes write_gid() calls. > Apparently the cache->lock mutex is used for that purpose. So why is it > necessary to use ACCESS_ONCE() here ? Why is it needed to prevent that > the compiler coalesces this write with another write into the same structure > ? > > > + /* Make sure the sequence number we remeber was read > > This looks like a typo - shouldn't the above read "remember" ? > > BTW, the style of that comment is recommended only for networking code > and not for IB code. Have you verified this patch with checkpatch ? > > > + mutex_lock(&cache->lock); > > + > > + for (ix = 0; ix < cache->sz; ix++) > > + if (cache->data_vec[ix].attr.ndev == ndev) > > + write_gid(ib_dev, port, cache, ix, &zgid, &zattr); > > + > > + mutex_unlock(&cache->lock); > > + return 0; > > The traditional Linux kernel coding style is one blank line before > mutex_lock() and after mutex_unlock() but not after mutex_lock() nor > before mutex_unlock(). > > > + orig_seq = ACCESS_ONCE(cache->data_vec[index].seq); > > + /* Make sure we read the sequence number before copying the > > +* gid to local storage. */ > > + smp_rmb(); > > Please use READ_ONCE() instead of ACCESS_ONCE() as recommended in > . > > > +static void free_roce_gid_cache(struct ib_device *ib_dev, u8 port) { > > + int i; > > + struct ib_roce_gid_cache *cache = > > + ib_dev->cache.roce_gid_cache[port - 1]; > > + > > + if (!cache) > > + return; > > + > > + for (i = 0; i < cache->sz; ++i) { > > + if (memcmp(&cache->data_vec[i].gid, &zgid, > > + sizeof(cache->data_vec[i].gid))) > > + write_gid(ib_dev, port, cache, i, &zgid, &zattr); > > + } > > + kfree(cache->data_vec); > > + kfree(cache); > > +} > > Overwriting data just before it is freed is not useful. Please use > CONFIG_SLUB_DEBUG=y to debug use-after-free issues instead of such > code. > > Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache
On 03/25/2015 02:19 PM, Somnath Kotur wrote: + if (cache->data_vec[ix].attr.ndev && + cache->data_vec[ix].attr.ndev != old_net_dev) A few lines earlier the memory old_net_dev points at was freed. If two instances of this function run concurrently, what prevents that the old_net_dev memory has been reallocated and hence that attr.ndev == old_net_dev although both pointers refer(red) to different network devices ? + ACCESS_ONCE(cache->data_vec[ix].seq) = orig_seq; Invoking write_gid() is only safe if the caller serializes write_gid() calls. Apparently the cache->lock mutex is used for that purpose. So why is it necessary to use ACCESS_ONCE() here ? Why is it needed to prevent that the compiler coalesces this write with another write into the same structure ? + /* Make sure the sequence number we remeber was read This looks like a typo - shouldn't the above read "remember" ? BTW, the style of that comment is recommended only for networking code and not for IB code. Have you verified this patch with checkpatch ? + mutex_lock(&cache->lock); + + for (ix = 0; ix < cache->sz; ix++) + if (cache->data_vec[ix].attr.ndev == ndev) + write_gid(ib_dev, port, cache, ix, &zgid, &zattr); + + mutex_unlock(&cache->lock); + return 0; The traditional Linux kernel coding style is one blank line before mutex_lock() and after mutex_unlock() but not after mutex_lock() nor before mutex_unlock(). + orig_seq = ACCESS_ONCE(cache->data_vec[index].seq); + /* Make sure we read the sequence number before copying the +* gid to local storage. */ + smp_rmb(); Please use READ_ONCE() instead of ACCESS_ONCE() as recommended in . +static void free_roce_gid_cache(struct ib_device *ib_dev, u8 port) +{ + int i; + struct ib_roce_gid_cache *cache = + ib_dev->cache.roce_gid_cache[port - 1]; + + if (!cache) + return; + + for (i = 0; i < cache->sz; ++i) { + if (memcmp(&cache->data_vec[i].gid, &zgid, + sizeof(cache->data_vec[i].gid))) + write_gid(ib_dev, port, cache, i, &zgid, &zattr); + } > + kfree(cache->data_vec); > + kfree(cache); > +} Overwriting data just before it is freed is not useful. Please use CONFIG_SLUB_DEBUG=y to debug use-after-free issues instead of such code. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html