RE: [PATCH 0/2] Adding support for RoCE V2 specification
Hi Moni, -Original Message- From: monisonli...@gmail.com [mailto:monisonli...@gmail.com] On Behalf Of Moni Shoua Sent: Tuesday, December 02, 2014 12:17 PM To: Somnath Kotur Cc: rol...@kernel.org; linux-rdma@vger.kernel.org Subject: Re: [PATCH 0/2] Adding support for RoCE V2 specification On Tue, Dec 2, 2014 at 7:35 AM, Somnath Kotur somnath.ko...@emulex.com wrote: HI Moni, Thank you for your comments , please find my response inline. The overarching goal behind this patch is to keep RDMA-CM as the central entity that decides the protocol (ROCEV2 /ROCEV1) and hides all the address resolution details from applications. I agree but with one comment. RDMA-CM, although preferred, is not mandatory to establish a connection. The solution needs to take care also for apps that don't use RDMA-CM Agreed and this patch takes care of that as well thanks to the solution that was put in place for IP-based GIDs anyway in `modify_qp` effectively by means of the ib_resolve_l2_attrs(). This is just a continuation of the existing RDMA-CM design philosophy for IP Based GIDs that lets applications operate over any form of RDMA Service in a completely transparent way without any changes to the applications themselves. Protocol Resolution(RoCEV2 or V1) must be done even before we send out the 1st connection request packet, the corresponding MAD pkt must be V2 for a destination that is behind a router, correct? I am not sure we want to try sending out RoCEv2 connection requests first and if that fails ,then fallback/retry with RoCEv1 request. I agree but when you send the first packet (request or response) you know the SGID index to be used. Knowing that and assuming you chose the SGID index correctly there is no need to guess. A suggestion for choosing the SGID index is: 1. First, populate GID table with GIDs matching the IP address list of the corresponding net devices. Each GID will appear twice, one for RoCEv1 and one for RoCEv2. 2. When connection is being established RDMA-CM will choose one of the 2 matching indices based on a preferred protocol attribute for the rmda_id (new attr.) 3. Proffered protocol is a) global for the node/fabric based on the administrator settings b) can be changed by applications that want to override the default (requires extra API) Agree to point 1 which is what we were thinking of doing as well as part of the 'override' option. Still feel that the 'default' protocol resolution should be based on help from the IP/Networking stack as that would straight away enable existing applications and majority of use-case scenarios that use RDMA-CM to seamlessly work without any change for RoCE V2. What this means is now RDMA-CM will choose one of the 2 matching GID indices based on help from the IP stack. For this to happen , the 'gid_cache' structure will have to undergo a change to now incorporate a new 'gid_type' field. Accordingly the helper functions that find a cached gid entry - ib_find_cached_gid() will also undergo a change to take into account this new field. I plan to resubmit the patch series making the following changes a) Populating GID table with 2 entries per IP address b) Changes to the GID cache structure along with it's helper functions c) Make sure that RDMA-CM selects/finds the corresponding GID once a network type is chosen based on help from the IP stack. The only thing missing is the ability for end-nodes connected locally inside same subnet to communicate using V2 and that could be provided as an 'override' option using the 'preferred protocol' attribute setting on a system-wide basis or on an application basis as you have proposed. This I propose as an additional patch after the above patch goes in first. Does this sound like a plan ? Thanks Som Hi Som Generally, I agree with the changes you suggested. To summarize (and make sure I got it right) your plan for V2 is to: 1. Associate gid_index with RoCE type in GID table and GID cache. Requires helper functions to search for gid_index according to GID value and type 2. Move the code that monitors IP address changes to populate GID table from device driver to IB/core. Requires each RoCE device driver to implement add_gid_entry/rem_gid_entry functions 3. Remove the RoCE type from address structure (e.g. ib_ah_attr). Leave it only in rdma_addr to keep the hint from the IP stack 4. Choose sgid_index (from all matching entries) in RDMA-CM based on hint from IP stack, user preference and admin policy Agree with all your points above , except for point# 2. That IMHO would be more of a code refactoring/cleanup patch that can be taken up later once the baseline version is checked in as it's not really related to this patch ? Also, if we do that , then we also have to take care of the IB case as well where GID is something
Re: [PATCH 0/2] Adding support for RoCE V2 specification
Agree with all your points above , except for point# 2. That IMHO would be more of a code refactoring/cleanup patch that can be taken up later once the baseline version is checked in as it's not really related to this patch ? Also, if we do that , then we also have to take care of the IB case as well where GID is something that is obtained from the port/card(query_gid) as opposed to something that is pushed down from the stack as you are now proposing for IP-based GIDs.. i.e implications are much more than what this patch intends to address. So I'd rather do this at a later point in time and not club it with V2, sound good ? Thanks Som Hi Som I think that it can be done in 2 stages. Moving GID table management to IB/core is indeed not mandatory for RoCEv2. However, you might want to consider it anyway since RoCEv2 support is going to add changes to GID table management anyhow. So, instead of modifying each device driver sperately, do it once centrally and add a plugin function in struct ib_device (e.g. ib_write_gid(union ib_gid, struct ib_gid_info *info)). We can work in parallel and implement the mlx4 function while you concentrate in IB/core and ocrdma. I leave it for you to decide how to proceed from here. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/2] Adding support for RoCE V2 specification
HI Moni, Thank you for your comments , please find my response inline. The overarching goal behind this patch is to keep RDMA-CM as the central entity that decides the protocol (ROCEV2 /ROCEV1) and hides all the address resolution details from applications. I agree but with one comment. RDMA-CM, although preferred, is not mandatory to establish a connection. The solution needs to take care also for apps that don't use RDMA-CM Agreed and this patch takes care of that as well thanks to the solution that was put in place for IP-based GIDs anyway in `modify_qp` effectively by means of the ib_resolve_l2_attrs(). This is just a continuation of the existing RDMA-CM design philosophy for IP Based GIDs that lets applications operate over any form of RDMA Service in a completely transparent way without any changes to the applications themselves. Protocol Resolution(RoCEV2 or V1) must be done even before we send out the 1st connection request packet, the corresponding MAD pkt must be V2 for a destination that is behind a router, correct? I am not sure we want to try sending out RoCEv2 connection requests first and if that fails ,then fallback/retry with RoCEv1 request. I agree but when you send the first packet (request or response) you know the SGID index to be used. Knowing that and assuming you chose the SGID index correctly there is no need to guess. A suggestion for choosing the SGID index is: 1. First, populate GID table with GIDs matching the IP address list of the corresponding net devices. Each GID will appear twice, one for RoCEv1 and one for RoCEv2. 2. When connection is being established RDMA-CM will choose one of the 2 matching indices based on a preferred protocol attribute for the rmda_id (new attr.) 3. Proffered protocol is a) global for the node/fabric based on the administrator settings b) can be changed by applications that want to override the default (requires extra API) Agree to point 1 which is what we were thinking of doing as well as part of the 'override' option. Still feel that the 'default' protocol resolution should be based on help from the IP/Networking stack as that would straight away enable existing applications and majority of use-case scenarios that use RDMA-CM to seamlessly work without any change for RoCE V2. What this means is now RDMA-CM will choose one of the 2 matching GID indices based on help from the IP stack. For this to happen , the 'gid_cache' structure will have to undergo a change to now incorporate a new 'gid_type' field. Accordingly the helper functions that find a cached gid entry - ib_find_cached_gid() will also undergo a change to take into account this new field. I plan to resubmit the patch series making the following changes a) Populating GID table with 2 entries per IP address b) Changes to the GID cache structure along with it's helper functions c) Make sure that RDMA-CM selects/finds the corresponding GID once a network type is chosen based on help from the IP stack. The only thing missing is the ability for end-nodes connected locally inside same subnet to communicate using V2 and that could be provided as an 'override' option using the 'preferred protocol' attribute setting on a system-wide basis or on an application basis as you have proposed. This I propose as an additional patch after the above patch goes in first. Does this sound like a plan ? Thanks Som N�r��yb�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
Re: [PATCH 0/2] Adding support for RoCE V2 specification
On Tue, Dec 2, 2014 at 7:35 AM, Somnath Kotur somnath.ko...@emulex.com wrote: HI Moni, Thank you for your comments , please find my response inline. The overarching goal behind this patch is to keep RDMA-CM as the central entity that decides the protocol (ROCEV2 /ROCEV1) and hides all the address resolution details from applications. I agree but with one comment. RDMA-CM, although preferred, is not mandatory to establish a connection. The solution needs to take care also for apps that don't use RDMA-CM Agreed and this patch takes care of that as well thanks to the solution that was put in place for IP-based GIDs anyway in `modify_qp` effectively by means of the ib_resolve_l2_attrs(). This is just a continuation of the existing RDMA-CM design philosophy for IP Based GIDs that lets applications operate over any form of RDMA Service in a completely transparent way without any changes to the applications themselves. Protocol Resolution(RoCEV2 or V1) must be done even before we send out the 1st connection request packet, the corresponding MAD pkt must be V2 for a destination that is behind a router, correct? I am not sure we want to try sending out RoCEv2 connection requests first and if that fails ,then fallback/retry with RoCEv1 request. I agree but when you send the first packet (request or response) you know the SGID index to be used. Knowing that and assuming you chose the SGID index correctly there is no need to guess. A suggestion for choosing the SGID index is: 1. First, populate GID table with GIDs matching the IP address list of the corresponding net devices. Each GID will appear twice, one for RoCEv1 and one for RoCEv2. 2. When connection is being established RDMA-CM will choose one of the 2 matching indices based on a preferred protocol attribute for the rmda_id (new attr.) 3. Proffered protocol is a) global for the node/fabric based on the administrator settings b) can be changed by applications that want to override the default (requires extra API) Agree to point 1 which is what we were thinking of doing as well as part of the 'override' option. Still feel that the 'default' protocol resolution should be based on help from the IP/Networking stack as that would straight away enable existing applications and majority of use-case scenarios that use RDMA-CM to seamlessly work without any change for RoCE V2. What this means is now RDMA-CM will choose one of the 2 matching GID indices based on help from the IP stack. For this to happen , the 'gid_cache' structure will have to undergo a change to now incorporate a new 'gid_type' field. Accordingly the helper functions that find a cached gid entry - ib_find_cached_gid() will also undergo a change to take into account this new field. I plan to resubmit the patch series making the following changes a) Populating GID table with 2 entries per IP address b) Changes to the GID cache structure along with it's helper functions c) Make sure that RDMA-CM selects/finds the corresponding GID once a network type is chosen based on help from the IP stack. The only thing missing is the ability for end-nodes connected locally inside same subnet to communicate using V2 and that could be provided as an 'override' option using the 'preferred protocol' attribute setting on a system-wide basis or on an application basis as you have proposed. This I propose as an additional patch after the above patch goes in first. Does this sound like a plan ? Thanks Som Hi Som Generally, I agree with the changes you suggested. To summarize (and make sure I got it right) your plan for V2 is to: 1. Associate gid_index with RoCE type in GID table and GID cache. Requires helper functions to search for gid_index according to GID value and type 2. Move the code that monitors IP address changes to populate GID table from device driver to IB/core. Requires each RoCE device driver to implement add_gid_entry/rem_gid_entry functions 3. Remove the RoCE type from address structure (e.g. ib_ah_attr). Leave it only in rdma_addr to keep the hint from the IP stack 4. Choose sgid_index (from all matching entries) in RDMA-CM based on hint from IP stack, user preference and admin policy If so, I guess that we have a plan. Thanks Moni -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] Adding support for RoCE V2 specification
Yes, that is correct. The overarching goal behind this patch is to keep RDMA-CM as the central entity that decides the protocol (ROCEV2 /ROCEV1) and hides all the address resolution details from applications. I agree but with one comment. RDMA-CM, although preferred, is not mandatory to establish a connection. The solution needs to take care also for apps that don't use RDMA-CM This is just a continuation of the existing RDMA-CM design philosophy for IP Based GIDs that lets applications operate over any form of RDMA Service in a completely transparent way without any changes to the applications themselves. Protocol Resolution(RoCEV2 or V1) must be done even before we send out the 1st connection request packet, the corresponding MAD pkt must be V2 for a destination that is behind a router, correct? I am not sure we want to try sending out RoCEv2 connection requests first and if that fails ,then fallback/retry with RoCEv1 request. I agree but when you send the first packet (request or response) you know the SGID index to be used. Knowing that and assuming you chose the SGID index correctly there is no need to guess. A suggestion for choosing the SGID index is: 1. First, populate GID table with GIDs matching the IP address list of the corresponding net devices. Each GID will appear twice, one for RoCEv1 and one for RoCEv2. 2. When connection is being established RDMA-CM will choose one of the 2 matching indices based on a preferred protocol attribute for the rmda_id (new attr.) 3. Proffered protocol is a) global for the node/fabric based on the administrator settings b) can be changed by applications that want to override the default (requires extra API) That would needlessly complicate and slow down the connection-establishment process, do you agree ? Instead it's much more seamless to take help of the IP stack to select the network header type and subsequently the ROCE pkt format while attempting to establish the connection. The one thing that my patch does miss out though is the ability for local subnet end nodes to be able to communicate using RoCEv2 packet formats. (The default policy in this patch is to use ROCEV1 for nodes connected directly /within the same local subnet.) I believe that can be achieved by extending RDMA-CM to override the default protocol selection policy proposed in this patch by making use of the GID Entry attribute type in another patch. Right, and this seems to fit the proposal above. This scheme should still work with your proposal in point# 2 , where GID Table management would move to a central/stack module instead of being vendor specific. Thanks Som -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/2] Adding support for RoCE V2 specification
Hi Moni/Shachar, Thank you for your feedback/comments. I have tried to address some of your concerns /comments inline below. Few comments to the IB/core part 1. RoCEv2 spec (Annex A17) sees RoCE type (or as the patch name it - network_hdr_type) as an attribute of the source GID. If so, there is no point to for address structures (ib_ah_attr, ib_sa_path_rec, …) to have both sgid and RoCE type. The alternative is to keep an extra info in each GID table entry about the RoCE header (IPv4/IPv6 or GRH) and use it when modifying QP from INIT to RTR (RC/UC) or creating AH (UD). Yes, that is correct. The overarching goal behind this patch is to keep RDMA-CM as the central entity that decides the protocol (ROCEV2 /ROCEV1) and hides all the address resolution details from applications. This is just a continuation of the existing RDMA-CM design philosophy for IP Based GIDs that lets applications operate over any form of RDMA Service in a completely transparent way without any changes to the applications themselves. Protocol Resolution(RoCEV2 or V1) must be done even before we send out the 1st connection request packet, the corresponding MAD pkt must be V2 for a destination that is behind a router, correct? I am not sure we want to try sending out RoCEv2 connection requests first and if that fails ,then fallback/retry with RoCEv1 request. That would needlessly complicate and slow down the connection-establishment process, do you agree ? Instead it's much more seamless to take help of the IP stack to select the network header type and subsequently the ROCE pkt format while attempting to establish the connection. The one thing that my patch does miss out though is the ability for local subnet end nodes to be able to communicate using RoCEv2 packet formats. (The default policy in this patch is to use ROCEV1 for nodes connected directly /within the same local subnet.) I believe that can be achieved by extending RDMA-CM to override the default protocol selection policy proposed in this patch by making use of the GID Entry attribute type in another patch. This scheme should still work with your proposal in point# 2 , where GID Table management would move to a central/stack module instead of being vendor specific. Thanks Som 2. Population of table in for RoCE should be a common code to all drivers and needs to be moved to IB/core (unlike today when it is specific to a vendor). We plan to push such a change soon. 3. If I get it right, the purpose of the change in ib_addr is to that determine RoCE type based on the resolution of address from OS. If gateway is used (dest address is not in the subnet of source address) then the protocol will be V2, otherwise it will be V1. This means that IB over UDP is not possible on the same network. I don't think that this is acceptable.
Re: [PATCH 0/2] Adding support for RoCE V2 specification
On Thu, Nov 27, 2014 at 2:56 AM, Somnath Kotur somnath.ko...@emulex.com wrote: Please apply. Somnath Kotur (2): IB/Core: Changes to the IB Core infrastructure for RoCE V2. IB/ocrdma: Changes to comply with RoCEv2 Specification changes drivers/infiniband/core/addr.c | 13 +- drivers/infiniband/core/cm.c| 15 --- drivers/infiniband/core/cma.c |4 +- drivers/infiniband/core/sa_query.c |1 + drivers/infiniband/core/verbs.c | 47 +--- drivers/infiniband/hw/ocrdma/ocrdma.h |2 + drivers/infiniband/hw/ocrdma/ocrdma_ah.c| 64 +- drivers/infiniband/hw/ocrdma/ocrdma_hw.c| 26 +++ drivers/infiniband/hw/ocrdma/ocrdma_sli.h | 16 ++- drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 11 - include/rdma/ib_addr.h |3 +- include/rdma/ib_sa.h|1 + include/rdma/ib_verbs.h | 15 ++ 13 files changed, 187 insertions(+), 31 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Few comments to the IB/core part 1. RoCEv2 spec (Annex A17) sees RoCE type (or as the patch name it - network_hdr_type) as an attribute of the source GID. If so, there is no point to for address structures (ib_ah_attr, ib_sa_path_rec, …) to have both sgid and RoCE type. The alternative is to keep an extra info in each GID table entry about the RoCE header (IPv4/IPv6 or GRH) and use it when modifying QP from INIT to RTR (RC/UC) or creating AH (UD). 2. Population of table in for RoCE should be a common code to all drivers and needs to be moved to IB/core (unlike today when it is specific to a vendor). We plan to push such a change soon. 3. If I get it right, the purpose of the change in ib_addr is to that determine RoCE type based on the resolution of address from OS. If gateway is used (dest address is not in the subnet of source address) then the protocol will be V2, otherwise it will be V1. This means that IB over UDP is not possible on the same network. I don't think that this is acceptable. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/2] Adding support for RoCE V2 specification
Hi, -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- ow...@vger.kernel.org] On Behalf Of Somnath Kotur Sent: Thursday, November 27, 2014 2:56 AM To: rol...@kernel.org Cc: linux-rdma@vger.kernel.org; Somnath Kotur Subject: [PATCH 0/2] Adding support for RoCE V2 specification Please apply. This cover letter is rather short. Can you please elaborate? Specifically, I will be happy to hear what is the expected usage model with RoCE v2. What are the goals of the changes you are asking to push? What assumptions do you have on the network topology? How do you decide when to use RoCE v1 and RoCE v2? What is the integration of RoCE v2 with the CMA? What changes in the user space libraries and applications are needed to support RoCE v2 on base of this kernel patch? Somnath Kotur (2): IB/Core: Changes to the IB Core infrastructure for RoCE V2. IB/ocrdma: Changes to comply with RoCEv2 Specification changes drivers/infiniband/core/addr.c | 13 +- drivers/infiniband/core/cm.c| 15 --- drivers/infiniband/core/cma.c |4 +- drivers/infiniband/core/sa_query.c |1 + drivers/infiniband/core/verbs.c | 47 +--- drivers/infiniband/hw/ocrdma/ocrdma.h |2 + drivers/infiniband/hw/ocrdma/ocrdma_ah.c| 64 +- drivers/infiniband/hw/ocrdma/ocrdma_hw.c| 26 +++ drivers/infiniband/hw/ocrdma/ocrdma_sli.h | 16 ++- drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 11 - include/rdma/ib_addr.h |3 +- include/rdma/ib_sa.h|1 + include/rdma/ib_verbs.h | 15 ++ 13 files changed, 187 insertions(+), 31 deletions(-) Thanks, --Shachar -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/2] Adding support for RoCE V2 specification
Thank you Shachar for your comments. Sure, will respond to your queries and elaborate more with an updated cover note and re-send the patch series . Hope that is fine with you? Thanks Som -Original Message- From: Shachar Raindel [mailto:rain...@mellanox.com] Sent: Thursday, November 27, 2014 9:22 PM To: Somnath Kotur; rol...@kernel.org Cc: linux-rdma@vger.kernel.org Subject: RE: [PATCH 0/2] Adding support for RoCE V2 specification Hi, -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- ow...@vger.kernel.org] On Behalf Of Somnath Kotur Sent: Thursday, November 27, 2014 2:56 AM To: rol...@kernel.org Cc: linux-rdma@vger.kernel.org; Somnath Kotur Subject: [PATCH 0/2] Adding support for RoCE V2 specification Please apply. This cover letter is rather short. Can you please elaborate? Specifically, I will be happy to hear what is the expected usage model with RoCE v2. What are the goals of the changes you are asking to push? What assumptions do you have on the network topology? How do you decide when to use RoCE v1 and RoCE v2? What is the integration of RoCE v2 with the CMA? What changes in the user space libraries and applications are needed to support RoCE v2 on base of this kernel patch? Somnath Kotur (2): IB/Core: Changes to the IB Core infrastructure for RoCE V2. IB/ocrdma: Changes to comply with RoCEv2 Specification changes drivers/infiniband/core/addr.c | 13 +- drivers/infiniband/core/cm.c| 15 --- drivers/infiniband/core/cma.c |4 +- drivers/infiniband/core/sa_query.c |1 + drivers/infiniband/core/verbs.c | 47 +--- drivers/infiniband/hw/ocrdma/ocrdma.h |2 + drivers/infiniband/hw/ocrdma/ocrdma_ah.c| 64 +- drivers/infiniband/hw/ocrdma/ocrdma_hw.c| 26 +++ drivers/infiniband/hw/ocrdma/ocrdma_sli.h | 16 ++- drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 11 - include/rdma/ib_addr.h |3 +- include/rdma/ib_sa.h|1 + include/rdma/ib_verbs.h | 15 ++ 13 files changed, 187 insertions(+), 31 deletions(-) Thanks, --Shachar -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html