Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
On 4/29/2013 11:36 PM, Jason Gunthorpe wrote: On Mon, Apr 29, 2013 at 10:52:21PM +0300, Or Gerlitz wrote: On Fri, Apr 26, 2013 at 12:40 AM, Jason Gunthorpe wrote: But I don't follow why the send QPNs have to be sequential for IPoIB. It looks like this is being motivated by RSS and RSS QPNs are just being reused for TSS? Go read It turns out that there are IPoIB drivers used by some operating-systems and/or Hypervisors in a para-virtualization (PV) scheme which extract the source QPN from the CQ WC associated with an incoming packets in order to.. and what follows in the change-log of patch 4/5 http://marc.info/?l=linux-rdmam=136412901621797w=2 This is what I said in the first place, the RFC is premised on the src.QPN to be set properly, you can't just mess with it, because stuff needs it. I think you should have split this patch up, there is lots going on here. - Add proper TSS that doesn't change the wire protocol - Add fake TSS that does change the wire protocol, and properly document those changes so other people can follow/implement them - Add RSS And.. 'tss_qpn_mask_sz' seems unnecessarily limiting, using WC.srcQPN + ipoib_header.tss_qpn_offset == real QPN (ie use a signed offset, not a mask) Seems much better than Wc.srcQPN ~((1(ipoib_header.tss_qpn_mask_sz 12))-1) == real QPN (Did I even get that right?) Specifically it means the requirements for alignment and contiguous-ness are gone. This means you can implement it without using the QP groups API and it will work immediately with every HCA out there. I think if we are going to actually mess with the wire protocol this sort of broad applicability is important. As for the other two questions: seems reasonable to me. Without a consensus among HW vendors how to do this it makes sense to move ahead *in the kernel* with a minimal API. Userspace is a different question of course.. Jason Hi Jason, Your suggestion could have been valid if the the IPoIB header was larger. Please note that the a QPN occupies 3 octets and thus its value lies in the range of [0..0xFF]. On the other hand the reserved field in the IPoIB header occupies only 2 octets, so given an arbitrary group of source QPN it may be not possible to recover the real QPN. This is why the real QPN should be a power of two and the rest should have consecutive numbers. And since the number of the TSS QP is relatively small, that is, in the order of the number of the cores than masking the lower bits of the Wc.srcQPN will recover the real QPN number. Also by sending only the mask length we don't use the entire reserved filed but only 4 bits leaving 12 bits to future use. Best regards, S.P. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
On Tue, Apr 30, 2013 at 12:04:25PM +0300, Shlomo Pongratz wrote: And.. 'tss_qpn_mask_sz' seems unnecessarily limiting, using WC.srcQPN + ipoib_header.tss_qpn_offset == real QPN (ie use a signed offset, not a mask) Seems much better than Wc.srcQPN ~((1(ipoib_header.tss_qpn_mask_sz 12))-1) == real QPN (Did I even get that right?) Your suggestion could have been valid if the the IPoIB header was larger. Please note that the a QPN occupies 3 octets and thus its value lies in the range of [0..0xFF]. I am aware of this, and it isn't really a problem, adaptors that allocate randomly across the entire QPN space would not be compatible with this approach, but most adaptors allocate QPNs quasi-contiguously. Basically, at startup, IPoIB would allocate a TX QP, then allocate TSS QPs, and throw away any that can't fit in the encoding, until it reaches the target number or tries too long. No need for a special API to the driver. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
Jason Gunthorpe jguntho...@obsidianresearch.com wrote: For the TSS case, I'd say just allocate normal QPs and provide something like ibv_override_ud_src_qpn(). This is very general and broadly useful for any application using UD QPs. I've lost you, how you suggest to implement ibv_override_ud_src_qpn(), is that for future HW or you have a method to get work today. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
On Tue, Apr 30, 2013 at 11:08:19PM +0300, Or Gerlitz wrote: Jason Gunthorpe jguntho...@obsidianresearch.com wrote: For the TSS case, I'd say just allocate normal QPs and provide something like ibv_override_ud_src_qpn(). This is very general and broadly useful for any application using UD QPs. I've lost you, how you suggest to implement ibv_override_ud_src_qpn(), is that for future HW or you have a method to get work today. I meant as a user space API alternative to the parent/child group API for transmit. It would require some level of driver/FW/HW support of course. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
On Fri, Apr 26, 2013 at 12:40 AM, Jason Gunthorpe Also, I feel what happens inside the kernel is more flexable API wise, so dropping the uverbs component may also be something you want to look at. We didn't submit any uverbs exporting of these verbs on this series. I am OK if the series is accepted for kernel use only (as was submitted) and later we open a discussion on the user space API where once converges, we can decide if to port the kernel RSS/TSS bits to the newly agreed API. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
On Fri, Apr 26, 2013 at 12:40 AM, Jason Gunthorpe wrote: But I don't follow why the send QPNs have to be sequential for IPoIB. It looks like this is being motivated by RSS and RSS QPNs are just being reused for TSS? Go read It turns out that there are IPoIB drivers used by some operating-systems and/or Hypervisors in a para-virtualization (PV) scheme which extract the source QPN from the CQ WC associated with an incoming packets in order to.. and what follows in the change-log of patch 4/5 http://marc.info/?l=linux-rdmam=136412901621797w=2 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
On Fri, Apr 26, 2013 at 12:40 AM, Jason Gunthorpe wrote: As Sean said earlier, please think about a single QP, multiple RQ/SQ style API - that seems much more general to me and also could reasonably be defined for other transport types. I find it to have too much of an abstraction for kernel level API, since that single QP isn't really a HW construct but rather something artificial. For UD/RAW PACKET QPs, RSS is natual and done on maybe 100 Ethernet NIC drivers, where a special steering rule sends RX packet to a dispatcher QP who applies hash and does 2nd dispatching to QPs/rings depending on the hash results, so now we want to bring that to IPoIB too, and we allow to specify that parent etc etc. Or. For instance, someday supporting multiple RQ on a RC transport, with content-based steering, is a limited form of tag matching.. From a longer-term user space API design standpoint the concept seems to have more longevity. Also, I feel what happens inside the kernel is more flexable API wise, so dropping the uverbs component may also be something you want to look at. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
On Mon, Apr 29, 2013 at 10:52:21PM +0300, Or Gerlitz wrote: On Fri, Apr 26, 2013 at 12:40 AM, Jason Gunthorpe wrote: But I don't follow why the send QPNs have to be sequential for IPoIB. It looks like this is being motivated by RSS and RSS QPNs are just being reused for TSS? Go read It turns out that there are IPoIB drivers used by some operating-systems and/or Hypervisors in a para-virtualization (PV) scheme which extract the source QPN from the CQ WC associated with an incoming packets in order to.. and what follows in the change-log of patch 4/5 http://marc.info/?l=linux-rdmam=136412901621797w=2 This is what I said in the first place, the RFC is premised on the src.QPN to be set properly, you can't just mess with it, because stuff needs it. I think you should have split this patch up, there is lots going on here. - Add proper TSS that doesn't change the wire protocol - Add fake TSS that does change the wire protocol, and properly document those changes so other people can follow/implement them - Add RSS And.. 'tss_qpn_mask_sz' seems unnecessarily limiting, using WC.srcQPN + ipoib_header.tss_qpn_offset == real QPN (ie use a signed offset, not a mask) Seems much better than Wc.srcQPN ~((1(ipoib_header.tss_qpn_mask_sz 12))-1) == real QPN (Did I even get that right?) Specifically it means the requirements for alignment and contiguous-ness are gone. This means you can implement it without using the QP groups API and it will work immediately with every HCA out there. I think if we are going to actually mess with the wire protocol this sort of broad applicability is important. As for the other two questions: seems reasonable to me. Without a consensus among HW vendors how to do this it makes sense to move ahead *in the kernel* with a minimal API. Userspace is a different question of course.. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
On Wed, Apr 24, 2013 at 02:24:45AM +, Hefty, Sean wrote: Conceptually, RSS/TSS are a set of send/receive work queues all belonging to the same transport level address. There's no parent-child relationship or needed pairing of send and receive queues together in order to form a group. This view makes sense to me as well. Can someone also confirm that using TSS doesn't affect the on-the-wire packets vs the non-TSS cases? I heard a few comments that sounded like TSS users get a per-queue QPN in the outgoing packet rather than a single QPN for the group, which would be pretty ugly. IMHO, this sort of stuff needs to have a very well defined on-the-wire behaviour, even if it is just documented in the ibverbs man pages. Personally, I'd like to see a way that better captures the notion of a 'set of work queues with the same address'. For example, it makes more sense to me if a user created/destroyed the work queues together, and if the WQs were viewed as being in a single state (INIT, RTR, RTS...). Yah, an API that made work queues a sub object of the QP seems to make much more sense than trying to manage an array of QPs. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
Conceptually, RSS/TSS are a set of send/receive work queues all belonging to the same transport level address. There's no parent-child relationship or needed pairing of send and receive queues together in order to form a group. This view makes sense to me as well. Can someone also confirm that using TSS doesn't affect the on-the-wire packets vs the non-TSS cases? I heard a few comments that sounded like TSS users get a per-queue QPN in the outgoing packet rather than a single QPN for the group, which would be pretty ugly. After speaking with Tzahi, my understanding is that the receive work queues all receive on the same QPN, but the send work queues use different QPNs. The on-wire packets are affected, specifically the ipoib header. This is why the send QPNs must be sequential, so that a mask can be applied at the receiving side to determine a single source QPN. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
On Thu, Apr 25, 2013 at 08:26:45PM +, Hefty, Sean wrote: Conceptually, RSS/TSS are a set of send/receive work queues all belonging to the same transport level address. There's no parent-child relationship or needed pairing of send and receive queues together in order to form a group. This view makes sense to me as well. Can someone also confirm that using TSS doesn't affect the on-the-wire packets vs the non-TSS cases? I heard a few comments that sounded like TSS users get a per-queue QPN in the outgoing packet rather than a single QPN for the group, which would be pretty ugly. After speaking with Tzahi, my understanding is that the receive work queues all receive on the same QPN, but the send work queues use different QPNs. The on-wire packets are affected, specifically the ipoib header. This is why the send QPNs must be sequential, so that a mask can be applied at the receiving side to determine a single source QPN. Ah, this seems contrary to the IPoIB specification? Someone should probably talk about how sending from the wrong QPN is acceptable.. As I said, that is ugly. 'TSS' that changes the on-the-wire packet is not TSS. It is just ganging QPs together. Allocating sequential TSS QPNs is an awful hack, what we really need is a way to force a UD QP's outgoing QPN. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
Jason Gunthorpe jguntho...@obsidianresearch.com wrote: On Thu, Apr 25, 2013 at 08:26:45PM +, Hefty, Sean wrote: After speaking with Tzahi, my understanding is that the receive work queues all receive on the same QPN, but the send work queues use different QPNs. The on-wire packets are affected, specifically the ipoib header. This is why the send QPNs must be sequential, so that a mask can be applied at the receiving side to determine a single source QPN. Ah, this seems contrary to the IPoIB specification? Someone should probably talk about how sending from the wrong QPN is acceptable.. AFAIK the IPoIB specification doesn't mandate the QPN of the sender As I said, that is ugly. 'TSS' that changes the on-the-wire packet is not TSS. It is just ganging QPs together. Allocating sequential TSS QPNs is an awful hack, what we really need is a way to force a UD QP's outgoing QPN. INDEED, but this must be supported by the HW. The patch set is already supporting the case of HW the knows to do that forcing, quoting --- IB_DEVICE_UD_TSS which is set to indicate that the device supports HW TSS which means that the HW is capable of over-riding the source UD QPN present in sent IB datagram header (DTH) with the parent's QPN --- where over such HW the on-the-wire IPoIB header isn't touched. BUT for the sake of improving performance and being competitive with tons of Linux Ethernet drivers that support TSS/MQ we still need IPoIB to support MQ/TSS before such HW is introduced, and as such the chosen solution was to use reserved fields of the wire header. How about we discuss RSS 1st? for RSS no wire change is introduced, lets see if/how we can come to an agreement how the RSS related verbs should look like and we'll take it from there to TSS. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
On Thu, Apr 25, 2013 at 11:56:16PM +0300, Or Gerlitz wrote: Ah, this seems contrary to the IPoIB specification? Someone should probably talk about how sending from the wrong QPN is acceptable.. AFAIK the IPoIB specification doesn't mandate the QPN of the sender I'd have to read it again very carefully.. However, checking the src QPN of every UD packet is the only way to detect if the packet was generated by the authentic kernel or from an unprivileged user space process, so there is a certainly importance in the value. But I don't follow why the send QPNs have to be sequential for IPoIB. It looks like this is being motivated by RSS and RSS QPNs are just being reused for TSS? As I said, that is ugly. 'TSS' that changes the on-the-wire packet is not TSS. It is just ganging QPs together. Allocating sequential TSS QPNs is an awful hack, what we really need is a way to force a UD QP's outgoing QPN. INDEED, but this must be supported by the HW. The patch set is already supporting the case of HW the knows to do that forcing, quoting --- IB_DEVICE_UD_TSS which is set to indicate that the device supports HW TSS which means that the HW is capable of over-riding the source UD QPN present in sent IB datagram header (DTH) with the parent's QPN --- where over such HW the on-the-wire IPoIB header isn't touched. For the TSS case, I'd say just allocate normal QPs and provide something like ibv_override_ud_src_qpn(). This is very general and broadly useful for any application using UD QPs. BUT for the sake of improving performance and being competitive with tons of Linux Ethernet drivers that support TSS/MQ we still need IPoIB to support MQ/TSS before such HW is introduced, and as such the chosen solution was to use reserved fields of the wire header. You've lost me again, what reserved bits? If a new uverb is introduced the on-the-wire behaviour needs to be fully documented.. How about we discuss RSS 1st? for RSS no wire change is introduced, lets see if/how we can come to an agreement how the RSS related verbs should look like and we'll take it from there to TSS. Well, to me, TSS is pretty simple. RSS is where things got really complicated.. As Sean said earlier, please think about a single QP, multiple RQ/SQ style API - that seems much more general to me and also could reasonably be defined for other transport types. For instance, someday supporting multiple RQ on a RC transport, with content-based steering, is a limited form of tag matching.. From a longer-term user space API design standpoint the concept seems to have more longevity. Also, I feel what happens inside the kernel is more flexable API wise, so dropping the uverbs component may also be something you want to look at. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
On Mon, Apr 22, 2013 at 7:46 PM, Or Gerlitz or.gerl...@gmail.com wrote: Sean, Tzahi -- I understand now that there have been few talkings @ the OFA meeting re this patch set. So what's the way to move forward, any concrete feedback that needs to be addressed here? This series is hanging since May 2012 and I'd like to see it gets in for 3.10, now if indeed Sean is OK with the general framework, please suggest. Sean, I understand that following some conversations help at the OFA meetings you kind of took back the concerns you raised regarding the concept of the verbs level QP group which is used by this series to implement RSS and TSS, can you acknoledge that? Roland, this series is been around for about a year now, any feedback or comments from your side that we need to address for it to get accepted? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
On Mon, Apr 22, 2013 at 7:46 PM, Or Gerlitz or.gerl...@gmail.com wrote: Sean, Tzahi -- I understand now that there have been few talkings @ the OFA meeting re this patch set. So what's the way to move forward, any concrete feedback that needs to be addressed here? This series is hanging since May 2012 and I'd like to see it gets in for 3.10, now if indeed Sean is OK with the general framework, please suggest. Sean, I understand that following some conversations help at the OFA meetings you kind of took back the concerns you raised regarding the concept of the verbs level QP group which is used by this series to implement RSS and TSS, can you acknoledge that? No - I agree with the RSS/TSS concept. That I've never had an issue with. My issue is that the current verbs API appears to be a poor fit. I don't have a good answer for an alternative. Conceptually, RSS/TSS are a set of send/receive work queues all belonging to the same transport level address. There's no parent-child relationship or needed pairing of send and receive queues together in order to form a group. Personally, I'd like to see a way that better captures the notion of a 'set of work queues with the same address'. For example, it makes more sense to me if a user created/destroyed the work queues together, and if the WQs were viewed as being in a single state (INIT, RTR, RTS...). I'm just thinking out loud here, hoping that it spurs ideas, but if we added a call like: struct ib_qp *ib_create_wq_array/set/group(...); then added the ability to specify which WQ a specific send or receive should be posted to, it may do a better job of capturing RSS/TSS concepts, but still make use of the existing calls. (Underneath this, the driver can allocate actual QPs with sequential QPNs or whatever is required, but that's not exposed.) Obviously, I haven't thought through specifics. I'll try to meet up with Diego and Tzahi tonight or tomorrow to discuss this further. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
On Mon, Apr 15, 2013 at 4:21 PM, Or Gerlitz or.gerl...@gmail.com wrote: Actually these comments and questions on the series come just a week before the annual OFA gathering, personally, I will not be there nor Shlomo who is the author of the patches, but Tzahi Oved from Mellanox who lead the architecture for the QP group concept is planned to attend and same for Sean, Roland and I hope you (Ira) too, same for Ali Ayoub and Liran Liss from Mellanox who are attending too, all in all, nice quorum to get into a room and do white boarding, open discussion, laughing, yelling and what ever needed to get a consensus. [...] Sean, Tzahi -- I understand now that there have been few talkings @ the OFA meeting re this patch set. So what's the way to move forward, any concrete feedback that needs to be addressed here? This series is hanging since May 2012 and I'd like to see it gets in for 3.10, now if indeed Sean is OK with the general framework, please suggest. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
Weiny, Ira ira.we...@intel.com wrote: ow...@vger.kernel.org] On Behalf Of Or Gerlitz RSS child QPs are plain UD or RAW Packet QPs that only have consecutive QPNs which is common requirement of HW for configuring the RSS parent which in networking is called the RSS indirection or dispatching QP. You can send and receive on them. How do you ensure that the QPN's are consecutive? Quoting from this patch change-log: start A QP group is a set of QPs consists of a parent QP and two disjoint sets of RSS and TSS QPs. The creation of a QP group is a two stage process: In the the 1st stage, the parent QP is created. In the 2nd stage the children QPs of the parent are created. Each child QP indicates if its a RSS or TSS QP. Both the TSS and RSS sets of QPs should have contiguous QP numbers. end When the parent is created we (the driver) are being told by the consumer (providing instance of struct ib_qpg_init_attrib) how many child QPs they would need, so we can internally act up front and make sure there's a consecutive chain of QPNs reserved for that group. If an RSS child goes to the error state it will not receive data. If you transition it back to RTS would it start working again? YES Could you remove it and add a new one? (I guess not because the new QPN would likely not be consecutive.) NO, its disallowed to destroy any of the child QPs as long as the parent is there, quoting from the change log: start It is forbidden to modify parent QP state before all RSS/TSS children were created. In the same manner it is disallowed to destroy the parent QP unless all RSS/TSS children were destroyed. end Packets are routed to RSS childs only per the hash function output, not per the state of that child. So if the QP chosen by the hash is in error state the packets get lost? Above you said they would not receive data. indeed, get lost, which means data will not be received, not sure I am following what isn't aligned to what I said in that above comment. Actually these comments and questions on the series come just a week before the annual OFA gathering, personally, I will not be there nor Shlomo who is the author of the patches, but Tzahi Oved from Mellanox who lead the architecture for the QP group concept is planned to attend and same for Sean, Roland and I hope you (Ira) too, same for Ali Ayoub and Liran Liss from Mellanox who are attending too, all in all, nice quorum to get into a room and do white boarding, open discussion, laughing, yelling and what ever needed to get a consensus. It would be good if a BOF would be set to discuss the QP groups concept and how to proceed with getting the verbs layer to support RSS/TSS so we can finally 1. embed them within the verbs language 2. support MQ/RSS in the IPoIB network driver Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups
-Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- ow...@vger.kernel.org] On Behalf Of Or Gerlitz Sent: Thursday, April 11, 2013 4:09 PM To: Hefty, Sean Cc: rol...@kernel.org; linux-rdma@vger.kernel.org; shlo...@mellanox.com; Tzahi Oved Subject: Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups On Thu, Apr 11, 2013 at 5:45 PM, Hefty, Sean sean.he...@intel.com wrote: [...] but lets get there after hopefully agreeing what is RSS QP group. So far, this is what I think it is: - a collection of related receive queues - each queue is configured somewhat separately - i.e. sized, CQ, sge size, etc. - receives are posted to the queues separately - the queues are allocated together This is where it gets confusing. They're allocated together, but through separate API calls. I'm not sure if they share states or not. Can all of them but one go into the error state and still receive data? Do packets get routed to whichever queue actually works, or do packets sometimes get dropped, but sometimes don't, depending on some local rules which have been programmed into the HCA? Can the queues really be destroyed independently? We only require (== implemented that) for the verbs level to mandate for the RSS parent not to be destroyed before ANY of the RSS childs is destroyed and be placed to action only after ALL RSS childs are created. The queues (RSS childs can be destroyed independently after the parent is destroyed, yes. RSS child QPs are plain UD or RAW Packet QPs that only have consecutive QPNs which is common requirement of HW for configuring the RSS parent which in networking is called the RSS indirection or dispatching QP. You can send and receive on them. How do you ensure that the QPN's are consecutive? If an RSS child goes to the error state it will not receive data. If you transition it back to RTS would it start working again? Could you remove it and add a new one? (I guess not because the new QPN would likely not be consecutive.) Packets are routed to RSS childs only per the hash function output, not per the state of that child. So if the QP chosen by the hash is in error state the packets get lost? Above you said they would not receive data. Ira This design doesn't allow for an app to do DoS attack on the HW either if they try that out or just have a bug, but does require them to think out their code (design/review/test) - fair enough. RSS exists in almost any Ethernet NIC you take from the shelf, and works in the manner I explained here, e.g if one of the Ethernet RSS child queues isn't functional packets hashed to it will not be received Is it even necessary to treat the receive queues as being independent? What happens if they're allocated, destroyed, and modified as a single entity? [...] can you elaborate/explain the question a little more? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups
I have no issue with RSS/TSS. But the 'qp group' interface to using this seems kludgy. lets try to be more specific On a node, this is multiple send/receive queues grouped together to form a larger construct. On the wire, this is a single QP - maybe? I'm still not clear on that. From what's written, all the send queues appear as a single QPN. The receive queues appear as different QPNs. Starting with RSS QP groups: its a group made of one parent QP and N RSS child QPs. On the wire everything is sent to the RSS parent QP, however, when the HW receives a packet for which this QP/QPN is the destination, it applies a hash function on the packet header and subject to the hash result dispatches the packet to one of the N child QPs. I'm not trying to be facetious here, but a QP is a 'pair' of queues - one send, one receive. The construct that you're talking about is a collection of send and/or receive queues, all addressed using the same transport level value. I don't see where or why there's a parent-child relationship between queues, either from the perspective of the user or the remote endpoint. The transport level address happens to be called a QPN, but that's just a misnomer. The design applies for IB UD QPs and Raw Ethernet Packet QP types, under IB the QPN of the parent is on the wire, under Eth, there are no QPNs on the wire, but that HW has some steering rule which makes certain packets to be steered to that RSS parent, and the RSS parent in turn further does dispatching decision (hashing) to determine which of the child RSS QPs will actually receive that packet. The current implementation only applies to UD QPs, but I don't see why the concept can't apply to UC or RC. In my mind, a QP is really a special case of the RSS/TSS construct. With IPoIB, the remote side is provided with the RSS parent QPN as part of the IPoIB HW address provided in the ARP reply payload, so packets are sent to that QPN. With RAW Packet Eth QPs, the remote side isn't aware to QPNs at all, all goes through a steering rule who is directing to the RSS parent. You can send packets over RSS packet QP but not receive packets. So for RSS, the remote side isn't aware to that QP group @ all. Makes sense? As for TSS QP groups, basically generally speaking, the only case that really matters are applications/drivers that care for the source QPN of a packet. but lets get there after hopefully agreeing what is RSS QP group. So far, this is what I think it is: - a collection of related receive queues - each queue is configured somewhat separately - i.e. sized, CQ, sge size, etc. - receives are posted to the queues separately - the queues are allocated together This is where it gets confusing. They're allocated together, but through separate API calls. I'm not sure if they share states or not. Can all of them but one go into the error state and still receive data? Do packets get routed to whichever queue actually works, or do packets sometimes get dropped, but sometimes don't, depending on some local rules which have been programmed into the HCA? Can the queues really be destroyed independently? Is it even necessary to treat the receive queues as being independent? What happens if they're allocated, destroyed, and modified as a single entity? We don't treat send and receive queues that belong as part of a single queue 'pair' as having completely separate states. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups
On Thu, Apr 11, 2013 at 5:45 PM, Hefty, Sean sean.he...@intel.com wrote: [...] but lets get there after hopefully agreeing what is RSS QP group. So far, this is what I think it is: - a collection of related receive queues - each queue is configured somewhat separately - i.e. sized, CQ, sge size, etc. - receives are posted to the queues separately - the queues are allocated together This is where it gets confusing. They're allocated together, but through separate API calls. I'm not sure if they share states or not. Can all of them but one go into the error state and still receive data? Do packets get routed to whichever queue actually works, or do packets sometimes get dropped, but sometimes don't, depending on some local rules which have been programmed into the HCA? Can the queues really be destroyed independently? We only require (== implemented that) for the verbs level to mandate for the RSS parent not to be destroyed before ANY of the RSS childs is destroyed and be placed to action only after ALL RSS childs are created. The queues (RSS childs can be destroyed independently after the parent is destroyed, yes. RSS child QPs are plain UD or RAW Packet QPs that only have consecutive QPNs which is common requirement of HW for configuring the RSS parent which in networking is called the RSS indirection or dispatching QP. You can send and receive on them. If an RSS child goes to the error state it will not receive data. Packets are routed to RSS childs only per the hash function output, not per the state of that child. This design doesn't allow for an app to do DoS attack on the HW either if they try that out or just have a bug, but does require them to think out their code (design/review/test) - fair enough. RSS exists in almost any Ethernet NIC you take from the shelf, and works in the manner I explained here, e.g if one of the Ethernet RSS child queues isn't functional packets hashed to it will not be received Is it even necessary to treat the receive queues as being independent? What happens if they're allocated, destroyed, and modified as a single entity? [...] can you elaborate/explain the question a little more? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups
Or Gerlitz or.gerl...@gmail.com wrote: Hefty, Sean sean.he...@intel.com wrote: I have no issue with RSS/TSS. But the 'qp group' interface to using this seems kludgy. lets try to be more specific On a node, this is multiple send/receive queues grouped together to form a larger construct. On the wire, this is a single QP - maybe? I'm still not clear on that. From what's written, all the send queues appear as a single QPN. The receive queues appear as different QPNs. Starting with RSS QP groups: its a group made of one parent QP and N RSS child QPs. On the wire everything is sent to the RSS parent QP, however, when the HW receives a packet for which this QP/QPN is the destination, it applies a hash function on the packet header and subject to the hash result dispatches the packet to one of the N child QPs. The design applies for IB UD QPs and Raw Ethernet Packet QP types, under IB the QPN of the parent is on the wire, under Eth, there are no QPNs on the wire, but that HW has some steering rule which makes certain packets to be steered to that RSS parent, and the RSS parent in turn further does dispatching decision (hashing) to determine which of the child RSS QPs will actually receive that packet. With IPoIB, the remote side is provided with the RSS parent QPN as part of the IPoIB HW address provided in the ARP reply payload, so packets are sent to that QPN. With RAW Packet Eth QPs, the remote side isn't aware to QPNs at all, all goes through a steering rule who is directing to the RSS parent. You can send packets over RSS packet QP but not receive packets. So for RSS, the remote side isn't aware to that QP group @ all. Makes sense? does it? As for TSS QP groups, basically generally speaking, the only case that really matters are applications/drivers that care for the source QPN of a packet. but lets get there after hopefully agreeing what is RSS QP group. Or. Signed-off-by: Shlomo Pongratz shlo...@mellanox.com --- drivers/infiniband/core/uverbs_cmd.c |1 + drivers/infiniband/core/verbs.c | 118 ++ drivers/infiniband/hw/amso1100/c2_provider.c |3 + drivers/infiniband/hw/cxgb3/iwch_provider.c |2 + drivers/infiniband/hw/cxgb4/qp.c |3 + drivers/infiniband/hw/ehca/ehca_qp.c |3 + drivers/infiniband/hw/ipath/ipath_qp.c |3 + drivers/infiniband/hw/mlx4/qp.c |3 + drivers/infiniband/hw/mthca/mthca_provider.c |3 + drivers/infiniband/hw/nes/nes_verbs.c|3 + drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |5 + drivers/infiniband/hw/qib/qib_qp.c |5 + include/rdma/ib_verbs.h | 40 - 13 files changed, 190 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index a7d00f6..b8f2dff 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -1581,6 +1581,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file, attr.sq_sig_type = cmd.sq_sig_all ? IB_SIGNAL_ALL_WR : IB_SIGNAL_REQ_WR; attr.qp_type = cmd.qp_type; attr.create_flags = 0; + attr.qpg_type = IB_QPG_NONE; attr.cap.max_send_wr = cmd.max_send_wr; attr.cap.max_recv_wr = cmd.max_recv_wr; diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index a8fdd33..f40f194 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -406,12 +406,98 @@ struct ib_qp *ib_open_qp(struct ib_xrcd *xrcd, } EXPORT_SYMBOL(ib_open_qp); +static int ib_qpg_verify(struct ib_qp_init_attr *qp_init_attr) +{ + /* RSS/TSS QP group basic validation */ + struct ib_qp *parent; + struct ib_qpg_init_attrib *attr; + struct ib_qpg_attr *pattr; + + switch (qp_init_attr-qpg_type) { + case IB_QPG_PARENT: + attr = qp_init_attr-parent_attrib; + if (attr-tss_child_count == 1) + return -EINVAL; /* doesn't make sense */ + if (attr-rss_child_count == 1) + return -EINVAL; /* doesn't make sense */ + if ((attr-tss_child_count == 0) + (attr-rss_child_count == 0)) + /* should be called with IP_QPG_NONE */ + return -EINVAL; + break; + case IB_QPG_CHILD_RX: + parent = qp_init_attr-qpg_parent; + if (!parent || parent-qpg_type != IB_QPG_PARENT) + return -EINVAL; + pattr = parent-qpg_attr.parent_attr; + if (!pattr-rss_child_count) + return -EINVAL; + if (atomic_read(pattr-rsscnt) = pattr-rss_child_count) + return -EINVAL; +
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups
This patch introduces the concept of RSS and TSS QP groups which allows for implementing them by low level drivers and using it by IPoIB and later also by user space ULPs. A QP group is a set of QPs consists of a parent QP and two disjoint sets of RSS and TSS QPs. The creation of a QP group is a two stage process: In the the 1st stage, the parent QP is created. In the 2nd stage the children QPs of the parent are created. Each child QP indicates if its a RSS or TSS QP. Both the TSS and RSS sets of QPs should have contiguous QP numbers. It is forbidden to modify parent QP state before all RSS/TSS children were created. In the same manner it is disallowed to destroy the parent QP unless all RSS/TSS children were destroyed. A few new elements/concepts are introduced to support this: Three new device capabilities that can be set by the low level driver: - IB_DEVICE_QPG which is set to indicate QP groups are supported. - IB_DEVICE_UD_RSS which is set to indicate that the device supports RSS, that is applying hash function on incoming TCP/UDP/IP packets and dispatching them to multiple rings (child QPs). - IB_DEVICE_UD_TSS which is set to indicate that the device supports HW TSS which means that the HW is capable of over-riding the source UD QPN present in sent IB datagram header (DTH) with the parent's QPN. Low level drivers not supporting HW TSS, could still support QP groups, such as combination is referred as SW TSS. Where in this case, the low level drive fills in the qpg_tss_mask_sz field of struct ib_qp_cap returned from ib_create_qp. Such that this mask can be used to retrieve the parent QPN from incoming packets carrying a child QPN (as of the contiguous QP numbers requirement). - max rss table size device attribute, which is the maximal size of the RSS indirection table supported by the device - qp group type attribute for qp creation saying whether this is a parent QP or rx/tx (rss/tss) child QP or none of the above for non rss/tss QPs. - per qp group type, another attribute is added, for parent QPs, the number of rx/tx child QPs and for child QPs pointer to the parent. - IB_QP_GROUP_RSS attribute mask, which should be used when modifying the parent QP state from reset to init On Tue, Apr 9, 2013 at 8:06 PM, Hefty, Sean sean.he...@intel.com wrote: I have no issue with RSS/TSS. But the 'qp group' interface to using this seems kludgy. lets try to be more specific On a node, this is multiple send/receive queues grouped together to form a larger construct. On the wire, this is a single QP - maybe? I'm still not clear on that. From what's written, all the send queues appear as a single QPN. The receive queues appear as different QPNs. Starting with RSS QP groups: its a group made of one parent QP and N RSS child QPs. On the wire everything is sent to the RSS parent QP, however, when the HW receives a packet for which this QP/QPN is the destination, it applies a hash function on the packet header and subject to the hash result dispatches the packet to one of the N child QPs. The design applies for IB UD QPs and Raw Ethernet Packet QP types, under IB the QPN of the parent is on the wire, under Eth, there are no QPNs on the wire, but that HW has some steering rule which makes certain packets to be steered to that RSS parent, and the RSS parent in turn further does dispatching decision (hashing) to determine which of the child RSS QPs will actually receive that packet. With IPoIB, the remote side is provided with the RSS parent QPN as part of the IPoIB HW address provided in the ARP reply payload, so packets are sent to that QPN. With RAW Packet Eth QPs, the remote side isn't aware to QPNs at all, all goes through a steering rule who is directing to the RSS parent. You can send packets over RSS packet QP but not receive packets. So for RSS, the remote side isn't aware to that QP group @ all. Makes sense? As for TSS QP groups, basically generally speaking, the only case that really matters are applications/drivers that care for the source QPN of a packet. but lets get there after hopefully agreeing what is RSS QP group. Or. Signed-off-by: Shlomo Pongratz shlo...@mellanox.com --- drivers/infiniband/core/uverbs_cmd.c |1 + drivers/infiniband/core/verbs.c | 118 ++ drivers/infiniband/hw/amso1100/c2_provider.c |3 + drivers/infiniband/hw/cxgb3/iwch_provider.c |2 + drivers/infiniband/hw/cxgb4/qp.c |3 + drivers/infiniband/hw/ehca/ehca_qp.c |3 + drivers/infiniband/hw/ipath/ipath_qp.c |3 + drivers/infiniband/hw/mlx4/qp.c |3 + drivers/infiniband/hw/mthca/mthca_provider.c |3 + drivers/infiniband/hw/nes/nes_verbs.c|3 + drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |5 + drivers/infiniband/hw/qib/qib_qp.c |5 + include/rdma/ib_verbs.h | 40