Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-30 Thread Shlomo Pongratz

On 4/29/2013 11:36 PM, Jason Gunthorpe wrote:

On Mon, Apr 29, 2013 at 10:52:21PM +0300, Or Gerlitz wrote:

On Fri, Apr 26, 2013 at 12:40 AM, Jason Gunthorpe wrote:

But I don't follow why the send QPNs have to be sequential for
IPoIB. It looks like this is being motivated by RSS and RSS QPNs are
just being reused for TSS?

Go read It turns out that there are IPoIB drivers used by some
operating-systems
and/or Hypervisors in a para-virtualization (PV) scheme which extract the
source QPN from the CQ WC associated with an incoming packets in order
to.. and what follows in the change-log of patch 4/5
http://marc.info/?l=linux-rdmam=136412901621797w=2

This is what I said in the first place, the RFC is premised on the
src.QPN to be set properly, you can't just mess with it, because stuff
needs it.

I think you should have split this patch up, there is lots going on
here.

- Add proper TSS that doesn't change the wire protocol
- Add fake TSS that does change the wire protocol, and
   properly document those changes so other people can
   follow/implement them
- Add RSS

And.. 'tss_qpn_mask_sz' seems unnecessarily limiting, using
  WC.srcQPN + ipoib_header.tss_qpn_offset == real QPN
  (ie use a signed offset, not a mask)
Seems much better than
  Wc.srcQPN  ~((1(ipoib_header.tss_qpn_mask_sz  12))-1) == real QPN
  (Did I even get that right?)

Specifically it means the requirements for alignment and
contiguous-ness are gone. This means you can implement it without
using the QP groups API and it will work immediately with every HCA
out there. I think if we are going to actually mess with the wire
protocol this sort of broad applicability is important.

As for the other two questions: seems reasonable to me. Without a
consensus among HW vendors how to do this it makes sense to move ahead
*in the kernel* with a minimal API. Userspace is a different question
of course..

Jason

Hi Jason,

Your suggestion could have been valid if the the IPoIB header was larger.
Please note that the a QPN occupies 3 octets and thus its value lies in 
the range of [0..0xFF].
On the other hand the reserved field in the IPoIB header occupies only 2 
octets, so given an arbitrary group of source QPN it may be not possible 
to recover the real QPN.
This is why the real QPN should be a power of two and the rest should 
have consecutive numbers. And since the number of the TSS QP is 
relatively small, that is, in the order of the number of the cores than 
masking the lower bits of the Wc.srcQPN will recover the real QPN 
number.
Also by sending only the mask length we don't use the entire reserved 
filed but only 4 bits leaving 12 bits to future use.


Best regards,

S.P.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-30 Thread Jason Gunthorpe
On Tue, Apr 30, 2013 at 12:04:25PM +0300, Shlomo Pongratz wrote:

 And.. 'tss_qpn_mask_sz' seems unnecessarily limiting, using
   WC.srcQPN + ipoib_header.tss_qpn_offset == real QPN
   (ie use a signed offset, not a mask)
 Seems much better than
   Wc.srcQPN  ~((1(ipoib_header.tss_qpn_mask_sz  12))-1) == real QPN
   (Did I even get that right?)

 Your suggestion could have been valid if the the IPoIB header was larger.
 Please note that the a QPN occupies 3 octets and thus its value lies
 in the range of [0..0xFF].

I am aware of this, and it isn't really a problem, adaptors that
allocate randomly across the entire QPN space would not be compatible
with this approach, but most adaptors allocate QPNs
quasi-contiguously.

Basically, at startup, IPoIB would allocate a TX QP, then allocate TSS
QPs, and throw away any that can't fit in the encoding, until it
reaches the target number or tries too long. No need for a special API
to the driver.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-30 Thread Or Gerlitz
Jason Gunthorpe jguntho...@obsidianresearch.com wrote:

 For the TSS case, I'd say just allocate normal QPs and provide
 something like ibv_override_ud_src_qpn(). This is very general and
 broadly useful for any application using UD QPs.


I've lost you, how you suggest to implement ibv_override_ud_src_qpn(), is
that for future HW or you have a method to get work today.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-30 Thread Jason Gunthorpe
On Tue, Apr 30, 2013 at 11:08:19PM +0300, Or Gerlitz wrote:
 Jason Gunthorpe jguntho...@obsidianresearch.com wrote:
 
 For the TSS case, I'd say just allocate normal QPs and provide
 something like ibv_override_ud_src_qpn(). This is very general and
 broadly useful for any application using UD QPs.
 
 I've lost you, how you suggest to implement ibv_override_ud_src_qpn(), is that
 for future HW or you have a method to get work today.

I meant as a user space API alternative to the parent/child group API
for transmit. It would require some level of driver/FW/HW support of
course.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-29 Thread Or Gerlitz
On Fri, Apr 26, 2013 at 12:40 AM, Jason Gunthorpe

 Also, I feel what happens inside the kernel is more flexable API
 wise, so dropping the uverbs component may also be something you want to look 
 at.

We didn't submit any uverbs exporting of these verbs on this series. I
am  OK if the series is accepted for kernel use only (as was
submitted) and later we open a discussion on the user space API where
once converges, we can decide if to port the kernel RSS/TSS bits to
the newly agreed API.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-29 Thread Or Gerlitz
On Fri, Apr 26, 2013 at 12:40 AM, Jason Gunthorpe wrote:
 But I don't follow why the send QPNs have to be sequential for
 IPoIB. It looks like this is being motivated by RSS and RSS QPNs are
 just being reused for TSS?

Go read It turns out that there are IPoIB drivers used by some
operating-systems
and/or Hypervisors in a para-virtualization (PV) scheme which extract the
source QPN from the CQ WC associated with an incoming packets in order
to.. and what follows in the change-log of patch 4/5
http://marc.info/?l=linux-rdmam=136412901621797w=2
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-29 Thread Or Gerlitz
On Fri, Apr 26, 2013 at 12:40 AM, Jason Gunthorpe wrote:

 As Sean said earlier, please think about a single QP, multiple RQ/SQ
 style API - that seems much more general to me and also could
 reasonably be defined for other transport types.

I find it to have too much of an abstraction for kernel level API,
since that single QP isn't really a HW construct but rather something
artificial. For UD/RAW PACKET QPs, RSS is natual and done on maybe 
100 Ethernet NIC drivers, where a special steering rule sends RX
packet to a dispatcher QP who applies hash and does 2nd dispatching to
QPs/rings depending on the hash results, so now we want to bring that
to IPoIB too, and we allow to specify that parent etc etc.

Or.



 For instance, someday supporting multiple RQ on a RC transport, with
 content-based steering, is a limited form of tag matching.. From a
 longer-term user space API design standpoint the concept seems to have
 more longevity.

 Also, I feel what happens inside the kernel is more flexable API
 wise, so dropping the uverbs component may also be something you want
 to look at.

 Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-29 Thread Jason Gunthorpe
On Mon, Apr 29, 2013 at 10:52:21PM +0300, Or Gerlitz wrote:
 On Fri, Apr 26, 2013 at 12:40 AM, Jason Gunthorpe wrote:
  But I don't follow why the send QPNs have to be sequential for
  IPoIB. It looks like this is being motivated by RSS and RSS QPNs are
  just being reused for TSS?
 
 Go read It turns out that there are IPoIB drivers used by some
 operating-systems
 and/or Hypervisors in a para-virtualization (PV) scheme which extract the
 source QPN from the CQ WC associated with an incoming packets in order
 to.. and what follows in the change-log of patch 4/5
 http://marc.info/?l=linux-rdmam=136412901621797w=2

This is what I said in the first place, the RFC is premised on the
src.QPN to be set properly, you can't just mess with it, because stuff
needs it.

I think you should have split this patch up, there is lots going on
here.

- Add proper TSS that doesn't change the wire protocol
- Add fake TSS that does change the wire protocol, and
  properly document those changes so other people can
  follow/implement them
- Add RSS

And.. 'tss_qpn_mask_sz' seems unnecessarily limiting, using
 WC.srcQPN + ipoib_header.tss_qpn_offset == real QPN
 (ie use a signed offset, not a mask)
Seems much better than
 Wc.srcQPN  ~((1(ipoib_header.tss_qpn_mask_sz  12))-1) == real QPN
 (Did I even get that right?)

Specifically it means the requirements for alignment and
contiguous-ness are gone. This means you can implement it without
using the QP groups API and it will work immediately with every HCA
out there. I think if we are going to actually mess with the wire
protocol this sort of broad applicability is important.

As for the other two questions: seems reasonable to me. Without a
consensus among HW vendors how to do this it makes sense to move ahead
*in the kernel* with a minimal API. Userspace is a different question
of course..

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-25 Thread Jason Gunthorpe
On Wed, Apr 24, 2013 at 02:24:45AM +, Hefty, Sean wrote:

 Conceptually, RSS/TSS are a set of send/receive work queues all
 belonging to the same transport level address.  There's no
 parent-child relationship or needed pairing of send and receive
 queues together in order to form a group.

This view makes sense to me as well. Can someone also confirm that
using TSS doesn't affect the on-the-wire packets vs the non-TSS cases?
I heard a few comments that sounded like TSS users get a per-queue QPN
in the outgoing packet rather than a single QPN for the group, which
would be pretty ugly.

IMHO, this sort of stuff needs to have a very well defined on-the-wire
behaviour, even if it is just documented in the ibverbs man pages.

 Personally, I'd like to see a way that better captures the notion of
 a 'set of work queues with the same address'.  For example, it makes
 more sense to me if a user created/destroyed the work queues
 together, and if the WQs were viewed as being in a single state
 (INIT, RTR, RTS...).

Yah, an API that made work queues a sub object of the QP seems to make
much more sense than trying to manage an array of QPs.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-25 Thread Hefty, Sean
  Conceptually, RSS/TSS are a set of send/receive work queues all
  belonging to the same transport level address.  There's no
  parent-child relationship or needed pairing of send and receive
  queues together in order to form a group.
 
 This view makes sense to me as well. Can someone also confirm that
 using TSS doesn't affect the on-the-wire packets vs the non-TSS cases?
 I heard a few comments that sounded like TSS users get a per-queue QPN
 in the outgoing packet rather than a single QPN for the group, which
 would be pretty ugly.

After speaking with Tzahi, my understanding is that the receive work queues all 
receive on the same QPN, but the send work queues use different QPNs.  The 
on-wire packets are affected, specifically the ipoib header.  This is why the 
send QPNs must be sequential, so that a mask can be applied at the receiving 
side to determine a single source QPN.
 
- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-25 Thread Jason Gunthorpe
On Thu, Apr 25, 2013 at 08:26:45PM +, Hefty, Sean wrote:
   Conceptually, RSS/TSS are a set of send/receive work queues all
   belonging to the same transport level address.  There's no
   parent-child relationship or needed pairing of send and receive
   queues together in order to form a group.
  
  This view makes sense to me as well. Can someone also confirm that
  using TSS doesn't affect the on-the-wire packets vs the non-TSS cases?
  I heard a few comments that sounded like TSS users get a per-queue QPN
  in the outgoing packet rather than a single QPN for the group, which
  would be pretty ugly.
 
 After speaking with Tzahi, my understanding is that the receive work
 queues all receive on the same QPN, but the send work queues use
 different QPNs.  The on-wire packets are affected, specifically the
 ipoib header.  This is why the send QPNs must be sequential, so that
 a mask can be applied at the receiving side to determine a single
 source QPN.

Ah, this seems contrary to the IPoIB specification? Someone should
probably talk about how sending from the wrong QPN is acceptable..

As I said, that is ugly. 'TSS' that changes the on-the-wire packet is
not TSS. It is just ganging QPs together.

Allocating sequential TSS QPNs is an awful hack, what we really need
is a way to force a UD QP's outgoing QPN.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-25 Thread Or Gerlitz
Jason Gunthorpe jguntho...@obsidianresearch.com wrote:

 On Thu, Apr 25, 2013 at 08:26:45PM +, Hefty, Sean wrote:
  After speaking with Tzahi, my understanding is that the receive work
  queues all receive on the same QPN, but the send work queues use
  different QPNs.  The on-wire packets are affected, specifically the
  ipoib header.  This is why the send QPNs must be sequential, so that
  a mask can be applied at the receiving side to determine a single
  source QPN.

 Ah, this seems contrary to the IPoIB specification? Someone should probably 
 talk about how sending from the wrong QPN is acceptable..



AFAIK the IPoIB specification doesn't mandate the QPN of the sender


 As I said, that is ugly. 'TSS' that changes the on-the-wire packet is not 
 TSS. It is just ganging QPs together.

 Allocating sequential TSS QPNs is an awful hack, what we really need is a way 
 to force a UD QP's outgoing QPN.


INDEED, but this must be supported by the HW. The patch set is already
supporting the case of HW the knows to do that forcing, quoting  ---
IB_DEVICE_UD_TSS which is set to indicate that the device supports HW
TSS which means that the HW is capable of over-riding the source UD
QPN present in sent IB datagram header (DTH) with the parent's QPN
--- where over such HW the on-the-wire IPoIB header isn't touched.

BUT for the sake of improving performance and being competitive with
tons of Linux Ethernet drivers that support TSS/MQ we still need IPoIB
to support MQ/TSS before such HW is introduced, and as such the chosen
solution was to use reserved fields of the wire header.

How about we discuss RSS 1st? for RSS no wire change is introduced,
lets see if/how we can come to an agreement how the RSS related verbs
should look like and we'll take it from there to TSS.

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-25 Thread Jason Gunthorpe
On Thu, Apr 25, 2013 at 11:56:16PM +0300, Or Gerlitz wrote:

  Ah, this seems contrary to the IPoIB specification? Someone should
  probably talk about how sending from the wrong QPN is acceptable..

 AFAIK the IPoIB specification doesn't mandate the QPN of the sender

I'd have to read it again very carefully.. However, checking the src
QPN of every UD packet is the only way to detect if the packet was
generated by the authentic kernel or from an unprivileged user space
process, so there is a certainly importance in the value.

But I don't follow why the send QPNs have to be sequential for
IPoIB. It looks like this is being motivated by RSS and RSS QPNs are
just being reused for TSS?

  As I said, that is ugly. 'TSS' that changes the on-the-wire packet
  is not TSS. It is just ganging QPs together.
 
  Allocating sequential TSS QPNs is an awful hack, what we really
  need is a way to force a UD QP's outgoing QPN.
 
 INDEED, but this must be supported by the HW. The patch set is already
 supporting the case of HW the knows to do that forcing, quoting  ---
 IB_DEVICE_UD_TSS which is set to indicate that the device supports HW
 TSS which means that the HW is capable of over-riding the source UD
 QPN present in sent IB datagram header (DTH) with the parent's QPN
 --- where over such HW the on-the-wire IPoIB header isn't touched.

For the TSS case, I'd say just allocate normal QPs and provide
something like ibv_override_ud_src_qpn(). This is very general and
broadly useful for any application using UD QPs.

 BUT for the sake of improving performance and being competitive with
 tons of Linux Ethernet drivers that support TSS/MQ we still need IPoIB
 to support MQ/TSS before such HW is introduced, and as such the chosen
 solution was to use reserved fields of the wire header.

You've lost me again, what reserved bits?

If a new uverb is introduced the on-the-wire behaviour needs to be
fully documented..

 How about we discuss RSS 1st? for RSS no wire change is introduced,
 lets see if/how we can come to an agreement how the RSS related verbs
 should look like and we'll take it from there to TSS.

Well, to me, TSS is pretty simple. RSS is where things got really
complicated..

As Sean said earlier, please think about a single QP, multiple RQ/SQ
style API - that seems much more general to me and also could
reasonably be defined for other transport types.

For instance, someday supporting multiple RQ on a RC transport, with
content-based steering, is a limited form of tag matching.. From a
longer-term user space API design standpoint the concept seems to have
more longevity.

Also, I feel what happens inside the kernel is more flexable API
wise, so dropping the uverbs component may also be something you want
to look at.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-23 Thread Or Gerlitz
On Mon, Apr 22, 2013 at 7:46 PM, Or Gerlitz or.gerl...@gmail.com wrote:
 Sean, Tzahi -- I understand now that there have been few talkings @
 the OFA meeting re this patch set. So what's the way to move forward,
 any concrete feedback that needs to be addressed here?  This series is
 hanging since May 2012 and I'd like to see it gets in for 3.10, now if
 indeed Sean is OK with the general framework, please suggest.

Sean,

I understand that following some conversations help at the OFA
meetings you kind of took back the concerns you raised regarding the
concept of the verbs level QP group which is used by this series to
implement RSS and TSS, can you acknoledge that?

Roland, this series is been around for about a year now, any feedback
or comments from your side that we need to address for it to get
accepted?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-23 Thread Hefty, Sean
 On Mon, Apr 22, 2013 at 7:46 PM, Or Gerlitz or.gerl...@gmail.com wrote:
  Sean, Tzahi -- I understand now that there have been few talkings @
  the OFA meeting re this patch set. So what's the way to move forward,
  any concrete feedback that needs to be addressed here?  This series is
  hanging since May 2012 and I'd like to see it gets in for 3.10, now if
  indeed Sean is OK with the general framework, please suggest.
 
 Sean,
 
 I understand that following some conversations help at the OFA
 meetings you kind of took back the concerns you raised regarding the
 concept of the verbs level QP group which is used by this series to
 implement RSS and TSS, can you acknoledge that?

No - I agree with the RSS/TSS concept.  That I've never had an issue with.  My 
issue is that the current verbs API appears to be a poor fit.  I don't have a 
good answer for an alternative.

Conceptually, RSS/TSS are a set of send/receive work queues all belonging to 
the same transport level address.  There's no parent-child relationship or 
needed pairing of send and receive queues together in order to form a group.

Personally, I'd like to see a way that better captures the notion of a 'set of 
work queues with the same address'.  For example, it makes more sense to me if 
a user created/destroyed the work queues together, and if the WQs were viewed 
as being in a single state (INIT, RTR, RTS...).

I'm just thinking out loud here, hoping that it spurs ideas, but if we added a 
call like:

struct ib_qp *ib_create_wq_array/set/group(...);

then added the ability to specify which WQ a specific send or receive should be 
posted to, it may do a better job of capturing RSS/TSS concepts, but still make 
use of the existing calls.  (Underneath this, the driver can allocate actual 
QPs  with sequential QPNs or whatever is required, but that's not exposed.)  
Obviously, I haven't thought through specifics.

I'll try to meet up with Diego and Tzahi tonight or tomorrow to discuss this 
further.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-22 Thread Or Gerlitz
On Mon, Apr 15, 2013 at 4:21 PM, Or Gerlitz or.gerl...@gmail.com wrote:
 Actually these comments and questions on the series come just a week
 before the annual OFA gathering, personally, I will not be there nor
 Shlomo who is the author of the patches, but Tzahi Oved from Mellanox
 who lead the architecture for the QP group concept is planned to
 attend and same for Sean, Roland and I hope you (Ira) too, same for
 Ali Ayoub and Liran Liss from Mellanox who are attending too, all in
 all, nice quorum to get into a room and do white boarding, open
 discussion, laughing, yelling and what ever needed to get a consensus.
[...]

Sean, Tzahi -- I understand now that there have been few talkings @
the OFA meeting re this patch set. So what's the way to move forward,
any concrete feedback that needs to be addressed here?  This series is
hanging since May 2012 and I'd like to see it gets in for 3.10, now if
indeed Sean is OK with the general framework, please suggest.

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-15 Thread Or Gerlitz
Weiny, Ira ira.we...@intel.com wrote:
 ow...@vger.kernel.org] On Behalf Of Or Gerlitz


 RSS child QPs are plain UD or RAW Packet QPs that only have consecutive
 QPNs which is common requirement of HW for configuring the RSS parent
 which in networking is called the RSS indirection or dispatching QP. You
 can send and receive on them.

 How do you ensure that the QPN's are consecutive?

Quoting from this patch change-log:

start A QP group is a set of QPs consists of a parent QP and two disjoint sets
of RSS and TSS QPs. The creation of a QP group is a two stage process:

In the the 1st stage, the parent QP is created.

In the 2nd stage the children QPs of the parent are created.

Each child QP indicates if its a RSS or TSS QP. Both the TSS
and RSS sets of QPs should have contiguous QP numbers. end

When the parent is created we (the driver) are being told by the
consumer (providing instance of struct  ib_qpg_init_attrib) how many
child QPs they would need, so we can internally act up front and make
sure there's a consecutive chain of QPNs reserved for that group.

 If an RSS child goes to the error state it will not receive data.

 If you transition it back to RTS would it start working again?

YES

 Could you remove it and add a new one?  (I guess not because the new QPN
 would likely not be consecutive.)

NO, its disallowed to destroy any of the child QPs as long as  the
parent is there, quoting from the change log:

start It is forbidden to modify parent QP state before all RSS/TSS children
were created. In the same manner it is disallowed to destroy the parent
QP unless all RSS/TSS children were destroyed. end

 Packets are routed to RSS childs only per the hash function output, not per
 the state of that child.

 So if the QP chosen by the hash is in error state the packets get lost?
 Above you said they would not receive data.

indeed, get lost, which means data will not be received, not sure I am
following what isn't aligned to what I said in that above comment.

Actually these comments and questions on the series come just a week
before the annual OFA gathering, personally, I will not be there nor
Shlomo who is the author of the patches, but Tzahi Oved from Mellanox
who lead the architecture for the QP group concept is planned to
attend and same for Sean, Roland and I hope you (Ira) too, same for
Ali Ayoub and Liran Liss from Mellanox who are attending too, all in
all, nice quorum to get into a room and do white boarding, open
discussion, laughing, yelling and what ever needed to get a consensus.

It would be good if a BOF would be set to discuss the QP groups
concept and how to proceed with getting the verbs layer to support
RSS/TSS so we can finally

1. embed them within the verbs language
2. support MQ/RSS in the IPoIB network driver

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups

2013-04-14 Thread Weiny, Ira
 -Original Message-
 From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
 ow...@vger.kernel.org] On Behalf Of Or Gerlitz
 Sent: Thursday, April 11, 2013 4:09 PM
 To: Hefty, Sean
 Cc: rol...@kernel.org; linux-rdma@vger.kernel.org;
 shlo...@mellanox.com; Tzahi Oved
 Subject: Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups
 
 On Thu, Apr 11, 2013 at 5:45 PM, Hefty, Sean sean.he...@intel.com wrote:
 [...]
  but lets get there after hopefully agreeing what is RSS QP group.
 
  So far, this is what I think it is:
  - a collection of related receive queues
  - each queue is configured somewhat separately - i.e. sized, CQ, sge size,
 etc.
  - receives are posted to the queues separately
  - the queues are allocated together
 
  This is where it gets confusing.  They're allocated together, but
  through separate API calls.
  I'm not sure if they share states or not.  Can all of them but one go
  into the error state  and still receive data?  Do packets get routed to
 whichever queue actually works, or do  packets sometimes get dropped,
 but sometimes don't, depending on some local rules  which have been
 programmed into the HCA?  Can the queues really be destroyed
 independently?
 
 We only require (== implemented that) for the verbs level to mandate for
 the RSS parent not to be destroyed before ANY of the RSS childs is destroyed
 and be placed to action only after ALL RSS childs are created. The queues
 (RSS childs can be destroyed independently after the parent is destroyed,
 yes.
 
 RSS child QPs are plain UD or RAW Packet QPs that only have consecutive
 QPNs which is common requirement of HW for configuring the RSS parent
 which in networking is called the RSS indirection or dispatching QP. You can
 send and receive on them.

How do you ensure that the QPN's are consecutive?

 
 If an RSS child goes to the error state it will not receive data.

If you transition it back to RTS would it start working again?

Could you remove it and add a new one?  (I guess not because the new QPN would 
likely not be consecutive.)

 
 Packets are routed to RSS childs only per the hash function output, not per
 the state of that child.

So if the QP chosen by the hash is in error state the packets get lost?  Above 
you said they would not receive data.

Ira

 
 This design doesn't allow for an app to do DoS attack on the HW either if they
 try that out or just have a bug, but does require them to think out their code
 (design/review/test) - fair enough. RSS exists in almost any Ethernet NIC you
 take from the shelf, and works in the manner I explained here, e.g if one of
 the Ethernet RSS child queues isn't functional packets hashed to it will not 
 be
 received
 
  Is it even necessary to treat the receive queues as being independent?
 What happens  if they're allocated, destroyed, and modified as a single
 entity?
 [...]
 
 can you elaborate/explain the question a little more?
 
 Or.
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in the
 body of a message to majord...@vger.kernel.org More majordomo info at
 http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups

2013-04-11 Thread Hefty, Sean
  I have no issue with RSS/TSS.  But the 'qp group' interface to using this
 seems kludgy.
 
 lets try to be more specific
 
  On a node, this is multiple send/receive queues grouped together to form a
 larger
  construct.  On the wire, this is a single QP - maybe?  I'm still not clear 
  on
 that.  From
  what's written, all the send queues appear as a single QPN.  The receive
 queues
  appear as different QPNs.
 
 Starting with RSS QP groups: its a group made of one parent QP and N
 RSS child QPs.
 
 On the wire everything is sent to the RSS parent QP, however, when the
 HW receives a packet for which this QP/QPN is the destination, it
 applies a hash function on the packet header and subject to the hash
 result dispatches the packet to one of the N child QPs.

I'm not trying to be facetious here, but a QP is a 'pair' of queues - one send, 
one receive.  The construct that you're talking about is a collection of send 
and/or receive queues, all addressed using the same transport level value.  I 
don't see where or why there's a parent-child relationship between queues, 
either from the perspective of the user or the remote endpoint.  The transport 
level address happens to be called a QPN, but that's just a misnomer.

 The design applies for IB UD QPs and Raw Ethernet Packet QP types,
 under IB the QPN of the parent is on the wire, under Eth, there are no
 QPNs on the wire, but that HW has some steering rule which makes
 certain packets to be steered to that RSS parent, and the RSS parent
 in turn further does dispatching decision (hashing) to determine which
 of the child RSS QPs will actually receive that packet.

The current implementation only applies to UD QPs, but I don't see why the 
concept can't apply to UC or RC.
 
In my mind, a QP is really a special case of the RSS/TSS construct.

 With IPoIB, the remote side is provided with the RSS parent QPN as
 part of the IPoIB HW address provided in the ARP reply payload, so
 packets are sent to that QPN. With RAW Packet Eth QPs, the remote side
 isn't aware to QPNs at all, all goes through a steering rule who is
 directing to the RSS parent.
 
 You can send packets over RSS packet QP but not receive packets.
 
 So for RSS, the remote side isn't aware to that QP group @ all.
 
 Makes sense?
 
 As for TSS QP groups, basically  generally speaking, the only case
 that really matters are applications/drivers that care for the source
 QPN of a packet.
 
 but lets get there after hopefully agreeing what is RSS QP group.

So far, this is what I think it is:

- a collection of related receive queues
- each queue is configured somewhat separately - i.e. sized, CQ, sge size, etc.
- receives are posted to the queues separately
- the queues are allocated together

This is where it gets confusing.  They're allocated together, but through 
separate API calls.

I'm not sure if they share states or not.  Can all of them but one go into the 
error state and still receive data?  Do packets get routed to whichever queue 
actually works, or do packets sometimes get dropped, but sometimes don't, 
depending on some local rules which have been programmed into the HCA?  Can the 
queues really be destroyed independently?

Is it even necessary to treat the receive queues as being independent?  What 
happens if they're allocated, destroyed, and modified as a single entity?  We 
don't treat send and receive queues that belong as part of a single queue 
'pair' as having completely separate states.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups

2013-04-11 Thread Or Gerlitz
On Thu, Apr 11, 2013 at 5:45 PM, Hefty, Sean sean.he...@intel.com wrote:
[...]
 but lets get there after hopefully agreeing what is RSS QP group.

 So far, this is what I think it is:
 - a collection of related receive queues
 - each queue is configured somewhat separately - i.e. sized, CQ, sge size, 
 etc.
 - receives are posted to the queues separately
 - the queues are allocated together

 This is where it gets confusing.  They're allocated together, but through 
 separate API
 calls.
 I'm not sure if they share states or not.  Can all of them but one go into 
 the error state  and still receive data?  Do packets get routed to whichever 
 queue actually works, or do  packets sometimes get dropped, but sometimes 
 don't, depending on some local rules  which have been programmed into the 
 HCA?  Can the queues really be destroyed
 independently?

We only require (== implemented that) for the verbs level to mandate
for the RSS parent not to be destroyed before ANY of the RSS childs is
destroyed and be placed to action only after ALL RSS childs are
created. The queues (RSS childs can be destroyed independently after
the parent is destroyed, yes.

RSS child QPs are plain UD or RAW Packet QPs that only have
consecutive QPNs which is common requirement of HW for configuring the
RSS parent which in networking is called the RSS indirection or
dispatching QP. You can send and receive on them.

If an RSS child goes to the error state it will not receive data.

Packets are routed to RSS childs only per the hash function output,
not per the state of that child.

This design doesn't allow for an app to do DoS attack on the HW either
if they try that out or just have a bug, but does require them to
think out their code (design/review/test) - fair enough. RSS exists in
almost any Ethernet NIC you take from the shelf, and works in the
manner I explained here, e.g if one of the Ethernet RSS child queues
isn't functional packets hashed to it will not be received

 Is it even necessary to treat the receive queues as being independent?  What 
 happens  if they're allocated, destroyed, and modified as a single entity?
[...]

can you elaborate/explain the question a little more?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups

2013-04-10 Thread Or Gerlitz
Or Gerlitz or.gerl...@gmail.com wrote:
Hefty, Sean sean.he...@intel.com wrote:
 I have no issue with RSS/TSS.  But the 'qp group' interface to using this 
 seems
 kludgy.

 lets try to be more specific

 On a node, this is multiple send/receive queues grouped together to form a 
 larger
 construct.  On the wire, this is a single QP - maybe?  I'm still not clear 
 on that.  From
 what's written, all the send queues appear as a single QPN.  The receive 
 queues
 appear as different QPNs.

 Starting with RSS QP groups: its a group made of one parent QP and N
 RSS child QPs.

 On the wire everything is sent to the RSS parent QP, however, when the
 HW receives a packet for which this QP/QPN is the destination, it
 applies a hash function on the packet header and subject to the hash
 result dispatches the packet to one of the N child QPs.

 The design applies for IB UD QPs and Raw Ethernet Packet QP types,
 under IB the QPN of the parent is on the wire, under Eth, there are no
 QPNs on the wire, but that HW has some steering rule which makes
 certain packets to be steered to that RSS parent, and the RSS parent
 in turn further does dispatching decision (hashing) to determine which
 of the child RSS QPs will actually receive that packet.

 With IPoIB, the remote side is provided with the RSS parent QPN as
 part of the IPoIB HW address provided in the ARP reply payload, so
 packets are sent to that QPN. With RAW Packet Eth QPs, the remote side
 isn't aware to QPNs at all, all goes through a steering rule who is
 directing to the RSS parent.

 You can send packets over RSS packet QP but not receive packets.

 So for RSS, the remote side isn't aware to that QP group @ all.

 Makes sense?

does it?


 As for TSS QP groups, basically  generally speaking, the only case
 that really matters are applications/drivers that care for the source
 QPN of a packet.

 but lets get there after hopefully agreeing what is RSS QP group.

 Or.


 Signed-off-by: Shlomo Pongratz shlo...@mellanox.com
 ---
  drivers/infiniband/core/uverbs_cmd.c |1 +
  drivers/infiniband/core/verbs.c  |  118 
 ++
  drivers/infiniband/hw/amso1100/c2_provider.c |3 +
  drivers/infiniband/hw/cxgb3/iwch_provider.c  |2 +
  drivers/infiniband/hw/cxgb4/qp.c |3 +
  drivers/infiniband/hw/ehca/ehca_qp.c |3 +
  drivers/infiniband/hw/ipath/ipath_qp.c   |3 +
  drivers/infiniband/hw/mlx4/qp.c  |3 +
  drivers/infiniband/hw/mthca/mthca_provider.c |3 +
  drivers/infiniband/hw/nes/nes_verbs.c|3 +
  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c  |5 +
  drivers/infiniband/hw/qib/qib_qp.c   |5 +
  include/rdma/ib_verbs.h  |   40 -
  13 files changed, 190 insertions(+), 2 deletions(-)

 diff --git a/drivers/infiniband/core/uverbs_cmd.c 
 b/drivers/infiniband/core/uverbs_cmd.c
 index a7d00f6..b8f2dff 100644
 --- a/drivers/infiniband/core/uverbs_cmd.c
 +++ b/drivers/infiniband/core/uverbs_cmd.c
 @@ -1581,6 +1581,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uverbs_file 
 *file,
 attr.sq_sig_type   = cmd.sq_sig_all ? IB_SIGNAL_ALL_WR : 
 IB_SIGNAL_REQ_WR;
 attr.qp_type   = cmd.qp_type;
 attr.create_flags  = 0;
 +   attr.qpg_type  = IB_QPG_NONE;

 attr.cap.max_send_wr = cmd.max_send_wr;
 attr.cap.max_recv_wr = cmd.max_recv_wr;
 diff --git a/drivers/infiniband/core/verbs.c 
 b/drivers/infiniband/core/verbs.c
 index a8fdd33..f40f194 100644
 --- a/drivers/infiniband/core/verbs.c
 +++ b/drivers/infiniband/core/verbs.c
 @@ -406,12 +406,98 @@ struct ib_qp *ib_open_qp(struct ib_xrcd *xrcd,
  }
  EXPORT_SYMBOL(ib_open_qp);

 +static int ib_qpg_verify(struct ib_qp_init_attr *qp_init_attr)
 +{
 +   /* RSS/TSS QP group basic validation */
 +   struct ib_qp *parent;
 +   struct ib_qpg_init_attrib *attr;
 +   struct ib_qpg_attr *pattr;
 +
 +   switch (qp_init_attr-qpg_type) {
 +   case IB_QPG_PARENT:
 +   attr = qp_init_attr-parent_attrib;
 +   if (attr-tss_child_count == 1)
 +   return -EINVAL; /* doesn't make sense */
 +   if (attr-rss_child_count == 1)
 +   return -EINVAL; /* doesn't make sense */
 +   if ((attr-tss_child_count == 0) 
 +   (attr-rss_child_count == 0))
 +   /* should be called with IP_QPG_NONE */
 +   return -EINVAL;
 +   break;
 +   case IB_QPG_CHILD_RX:
 +   parent = qp_init_attr-qpg_parent;
 +   if (!parent || parent-qpg_type != IB_QPG_PARENT)
 +   return -EINVAL;
 +   pattr = parent-qpg_attr.parent_attr;
 +   if (!pattr-rss_child_count)
 +   return -EINVAL;
 +   if (atomic_read(pattr-rsscnt) = pattr-rss_child_count)
 +   return -EINVAL;
 +   

Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups

2013-04-09 Thread Or Gerlitz
 This patch introduces the concept of RSS and TSS QP groups which
 allows for implementing them by low level drivers and using it
 by IPoIB and later also by user space ULPs.

 A QP group is a set of QPs consists of a parent QP and two disjoint sets
 of RSS and TSS QPs. The creation of a QP group is a two stage process:

 In the the 1st stage, the parent QP is created.

 In the 2nd stage the children QPs of the parent are created.

 Each child QP indicates if its a RSS or TSS QP. Both the TSS
 and RSS sets of QPs should have contiguous QP numbers.

 It is forbidden to modify parent QP state before all RSS/TSS children
 were created. In the same manner it is disallowed to destroy the parent
 QP unless all RSS/TSS children were destroyed.

 A few new elements/concepts are introduced to support this:

 Three new device capabilities that can be set by the low level driver:

 - IB_DEVICE_QPG which is set to indicate QP groups are supported.

 - IB_DEVICE_UD_RSS which is set to indicate that the device supports
 RSS, that is applying hash function on incoming TCP/UDP/IP packets and
 dispatching them to multiple rings (child QPs).

 - IB_DEVICE_UD_TSS which is set to indicate that the device supports
 HW TSS which means that the HW is capable of over-riding the source
 UD QPN present in sent IB datagram header (DTH) with the parent's QPN.

 Low level drivers not supporting HW TSS, could still support QP groups, such
 as combination is referred as SW TSS. Where in this case, the low level 
 drive
 fills in the qpg_tss_mask_sz field of struct ib_qp_cap returned from
 ib_create_qp. Such that this mask can be used to retrieve the parent QPN from
 incoming packets carrying a child QPN (as of the contiguous QP numbers 
 requirement).

 - max rss table size device attribute, which is the maximal size of the RSS
 indirection table  supported by the device

 - qp group type attribute for qp creation saying whether this is a parent QP
 or rx/tx (rss/tss) child QP or none of the above for non rss/tss QPs.

 - per qp group type, another attribute is added, for parent QPs, the number
 of rx/tx child QPs and for child QPs pointer to the parent.

 - IB_QP_GROUP_RSS attribute mask, which should be used when modifying
 the parent QP state from reset to init


On Tue, Apr 9, 2013 at 8:06 PM, Hefty, Sean sean.he...@intel.com wrote:

 I have no issue with RSS/TSS.  But the 'qp group' interface to using this 
 seems kludgy.

lets try to be more specific

 On a node, this is multiple send/receive queues grouped together to form a 
 larger
 construct.  On the wire, this is a single QP - maybe?  I'm still not clear on 
 that.  From
 what's written, all the send queues appear as a single QPN.  The receive 
 queues
 appear as different QPNs.

Starting with RSS QP groups: its a group made of one parent QP and N
RSS child QPs.

On the wire everything is sent to the RSS parent QP, however, when the
HW receives a packet for which this QP/QPN is the destination, it
applies a hash function on the packet header and subject to the hash
result dispatches the packet to one of the N child QPs.

The design applies for IB UD QPs and Raw Ethernet Packet QP types,
under IB the QPN of the parent is on the wire, under Eth, there are no
QPNs on the wire, but that HW has some steering rule which makes
certain packets to be steered to that RSS parent, and the RSS parent
in turn further does dispatching decision (hashing) to determine which
of the child RSS QPs will actually receive that packet.

With IPoIB, the remote side is provided with the RSS parent QPN as
part of the IPoIB HW address provided in the ARP reply payload, so
packets are sent to that QPN. With RAW Packet Eth QPs, the remote side
isn't aware to QPNs at all, all goes through a steering rule who is
directing to the RSS parent.

You can send packets over RSS packet QP but not receive packets.

So for RSS, the remote side isn't aware to that QP group @ all.

Makes sense?

As for TSS QP groups, basically  generally speaking, the only case
that really matters are applications/drivers that care for the source
QPN of a packet.

but lets get there after hopefully agreeing what is RSS QP group.

Or.


 Signed-off-by: Shlomo Pongratz shlo...@mellanox.com
 ---
  drivers/infiniband/core/uverbs_cmd.c |1 +
  drivers/infiniband/core/verbs.c  |  118 
 ++
  drivers/infiniband/hw/amso1100/c2_provider.c |3 +
  drivers/infiniband/hw/cxgb3/iwch_provider.c  |2 +
  drivers/infiniband/hw/cxgb4/qp.c |3 +
  drivers/infiniband/hw/ehca/ehca_qp.c |3 +
  drivers/infiniband/hw/ipath/ipath_qp.c   |3 +
  drivers/infiniband/hw/mlx4/qp.c  |3 +
  drivers/infiniband/hw/mthca/mthca_provider.c |3 +
  drivers/infiniband/hw/nes/nes_verbs.c|3 +
  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c  |5 +
  drivers/infiniband/hw/qib/qib_qp.c   |5 +
  include/rdma/ib_verbs.h  |   40