[RFC] Add support for hardware transmit rate limiting queues [WAS: Add support for changing the flow ID of TCP connections]

2014-08-20 Thread Hans Petter Selasky

Hi,

A month has passed since the last e-mail on this topic, and in the 
meanwhile some new patches have been created and tested:


Basically the approach has been changed a little bit:

- The creation of hardware transmit rings has been made independent of 
the TCP stack. This allows firewall applications to forward traffic into 
hardware transmit rings aswell, and not only native TCP applications. 
This should be one more reason to get the feature into the kernel.


- A hardware transmit ring basically can have two modes: FIXED-RATE or 
AUTOMATIC-RATE. In the fixed rate mode all traffic is sent at a fixed 
bytes per second rate. In the automatic mode you can configure a time 
after which the TX queue must be empty. The hardware driver uses this to 
configure the actual rate. In automatic mode you can also set an upper 
and lower transmit rate limit.


- The MBUF has got a new field in the packet header: txringid

- IOCTLs for TCP v4 and v6 sockets has been updated to allow setting of 
the txringid field in the mbuf.


The current patch [see attachment] should be much simpler and less 
intrusive than the previous one.


Any comments ?

--HPS
=== sys/net/if.h
==
--- sys/net/if.h	(revision 270138)
+++ sys/net/if.h	(local)
@@ -239,6 +239,7 @@
 #define	IFCAP_RXCSUM_IPV6	0x20  /* can offload checksum on IPv6 RX */
 #define	IFCAP_TXCSUM_IPV6	0x40  /* can offload checksum on IPv6 TX */
 #define	IFCAP_HWSTATS		0x80 /* manages counters internally */
+#define	IFCAP_HWTXRINGS		0x100 /* hardware supports TX rings */
 
 #define IFCAP_HWCSUM_IPV6	(IFCAP_RXCSUM_IPV6 | IFCAP_TXCSUM_IPV6)
 
=== sys/netinet/in.c
==
--- sys/netinet/in.c	(revision 270138)
+++ sys/netinet/in.c	(local)
@@ -42,6 +42,7 @@
 #include sys/malloc.h
 #include sys/priv.h
 #include sys/socket.h
+#include sys/socketvar.h
 #include sys/jail.h
 #include sys/kernel.h
 #include sys/proc.h
@@ -201,9 +202,23 @@
 	struct in_ifaddr *ia;
 	int error;
 
-	if (ifp == NULL)
-		return (EADDRNOTAVAIL);
+	if (ifp == NULL) {
+		struct inpcb *inp;
 
+		switch (cmd) {
+		case SIOCSTXRINGID:
+			inp = sotoinpcb(so);
+			if (inp == NULL)
+return (EINVAL);
+			INP_WLOCK(inp);
+			inp-inp_txringid = *(unsigned *)data;
+			INP_WUNLOCK(inp);
+			return (0);
+		default:
+			return (EADDRNOTAVAIL);
+		}
+	}
+
 	/*
 	 * Filter out 4 ioctls we implement directly.  Forward the rest
 	 * to specific functions and ifp-if_ioctl().
=== sys/netinet/in_pcb.h
==
--- sys/netinet/in_pcb.h	(revision 270138)
+++ sys/netinet/in_pcb.h	(local)
@@ -46,6 +46,7 @@
 #ifdef _KERNEL
 #include sys/lock.h
 #include sys/rwlock.h
+#include sys/mbuf.h
 #include net/vnet.h
 #include vm/uma.h
 #endif
@@ -177,7 +178,8 @@
 	u_char	inp_ip_ttl;		/* (i) time to live proto */
 	u_char	inp_ip_p;		/* (c) protocol proto */
 	u_char	inp_ip_minttl;		/* (i) minimum TTL or drop */
-	uint32_t inp_flowid;		/* (x) flow id / queue id */
+	m_flowid_t inp_flowid;		/* (x) flow ID */
+	m_txringid_t inp_txringid;		/* (x) transmit ring ID */
 	u_int	inp_refcount;		/* (i) refcount */
 	void	*inp_pspare[5];		/* (x) route caching / general use */
 	uint32_t inp_flowtype;		/* (x) M_HASHTYPE value */
=== sys/netinet/in_var.h
==
--- sys/netinet/in_var.h	(revision 270138)
+++ sys/netinet/in_var.h	(local)
@@ -33,6 +33,7 @@
 #ifndef _NETINET_IN_VAR_H_
 #define _NETINET_IN_VAR_H_
 
+#include sys/mbuf.h
 #include sys/queue.h
 #include sys/fnv_hash.h
 #include sys/tree.h
@@ -81,6 +82,18 @@
 	struct	sockaddr_in ifra_mask;
 	int	ifra_vhid;
 };
+
+struct in_ratectlreq {
+	char		ifreq_name[IFNAMSIZ];
+	m_txringid_t	tx_ring_id;
+	uint32_t	min_bytes_per_interval;
+	uint32_t	max_bytes_per_interval;
+	uint32_t	micros_per_interval;
+	uint32_t	mode;
+#define	IN_RATECTLREQ_MODE_FIXED 0	/* min rate = max rate */
+#define	IN_RATECTLREQ_MODE_AUTOMATIC 1	/* bounded by min/max */
+};
+
 /*
  * Given a pointer to an in_ifaddr (ifaddr),
  * return a pointer to the addr as a sockaddr_in.
=== sys/netinet/ip_output.c
==
--- sys/netinet/ip_output.c	(revision 270138)
+++ sys/netinet/ip_output.c	(local)
@@ -145,6 +145,7 @@
 	if (inp != NULL) {
 		INP_LOCK_ASSERT(inp);
 		M_SETFIB(m, inp-inp_inc.inc_fibnum);
+		m-m_pkthdr.txringid = inp-inp_txringid;
 		if (inp-inp_flags  (INP_HW_FLOWID|INP_SW_FLOWID)) {
 			m-m_pkthdr.flowid = inp-inp_flowid;
 			M_HASHTYPE_SET(m, inp-inp_flowtype);
=== sys/netinet6/in6.c
==
--- sys/netinet6/in6.c	(revision 270138)
+++ sys/netinet6/in6.c	(local)
@@ -235,6 +235,23 @@
 	int error;
 	u_long ocmd = cmd;
 
+	if (ifp == NULL) {
+		struct inpcb *inp;
+
+		switch (cmd) {
+		case SIOCSTXRINGID:
+			inp = sotoinpcb(so);
+			if (inp == NULL)
+			

Re: [RFC] Add support for hardware transmit rate limiting queues [WAS: Add support for changing the flow ID of TCP connections]

2014-08-20 Thread Luigi Rizzo
On Wed, Aug 20, 2014 at 9:34 AM, Hans Petter Selasky h...@selasky.org
wrote:

 Hi,

 A month has passed since the last e-mail on this topic, and in the
 meanwhile some new patches have been created and tested:

 Basically the approach has been changed a little bit:

 - The creation of hardware transmit rings has been made independent of the
 TCP stack. This allows firewall applications to forward traffic into
 hardware transmit rings aswell, and not only native TCP applications. This
 should be one more reason to get the feature into the kernel.

 - A hardware transmit ring basically can have two modes: FIXED-RATE or
 AUTOMATIC-RATE. In the fixed rate mode all traffic is sent at a fixed bytes
 per second rate. In the automatic mode you can configure a time after which
 the TX queue must be empty. The hardware driver uses this to configure the
 actual rate. In automatic mode you can also set an upper and lower transmit
 rate limit.

 - The MBUF has got a new field in the packet header: txringid

 - IOCTLs for TCP v4 and v6 sockets has been updated to allow setting of
 the txringid field in the mbuf.

 The current patch [see attachment] should be much simpler and less
 intrusive than the previous one.


​the patch seems to include only part of the generic code (ie no ioctls
for manipulating the rates, no backend code). Do i miss something ?

I have a few comments/concerns:

+ looks like flowid and txringid are overlapped in scope,
  both will be used (in the backend) to select a specific
  tx queue. I don't have a solution but would like to know
  how do you plan to address this -- does one have priority
  over the other, etc.
+ related to the above, a (possibly unavoidable) side effect
  of this type of changes is that mbufs explode with custom fields,
  so if we could perhaps make one between flowid and txringid,
  that would be useful.

+ is there a way to ​avoid the replicated code for SIOCSTXRINGID
  (the ioctl handler, i suppose). Maybe make one function and
  call it from both ipv4 and ipv6, assuming there aren't other
  places like this.

+ i am not particularly happy about the explosion of ioctls for
  setting and getting rates. Next we'll want to add scheduling,
  and intervals, and queue sizes and so on.
  For these commands outside the critical path it would be
  preferable a single command with an extensible structure.
  Bikeshed material i am sure.

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC] Add support for hardware transmit rate limiting queues [WAS: Add support for changing the flow ID of TCP connections]

2014-08-20 Thread Hans Petter Selasky

Hi Luigi,

On 08/20/14 11:32, Luigi Rizzo wrote:

On Wed, Aug 20, 2014 at 9:34 AM, Hans Petter Selasky h...@selasky.org
wrote:


Hi,

A month has passed since the last e-mail on this topic, and in the
meanwhile some new patches have been created and tested:

Basically the approach has been changed a little bit:

- The creation of hardware transmit rings has been made independent of the
TCP stack. This allows firewall applications to forward traffic into
hardware transmit rings aswell, and not only native TCP applications. This
should be one more reason to get the feature into the kernel.

- A hardware transmit ring basically can have two modes: FIXED-RATE or
AUTOMATIC-RATE. In the fixed rate mode all traffic is sent at a fixed bytes
per second rate. In the automatic mode you can configure a time after which
the TX queue must be empty. The hardware driver uses this to configure the
actual rate. In automatic mode you can also set an upper and lower transmit
rate limit.

- The MBUF has got a new field in the packet header: txringid

- IOCTLs for TCP v4 and v6 sockets has been updated to allow setting of
the txringid field in the mbuf.

The current patch [see attachment] should be much simpler and less
intrusive than the previous one.



​the patch seems to include only part of the generic code (ie no ioctls
for manipulating the rates, no backend code). Do i miss something ?


The IOCTLs for managing the rates are:

SIOCARATECTL, SIOCSRATECTL, SIOCGRATECTL and SIOCDRATECTL

And they go to the if_ioctl callback.



I have a few comments/concerns:

+ looks like flowid and txringid are overlapped in scope,
   both will be used (in the backend) to select a specific
   tx queue. I don't have a solution but would like to know
   how do you plan to address this -- does one have priority
   over the other, etc.


Not 100% . In some cases the flowID is used differently than the 
txringid, though it might be possible to join the two. Would need to 
investigate current users of the flow ID.



+ related to the above, a (possibly unavoidable) side effect
   of this type of changes is that mbufs explode with custom fields,
   so if we could perhaps make one between flowid and txringid,
   that would be useful.


Right, but ratecontrol is an in-general useful feature, especially for 
high throughput networks, or do you think otherwise?




+ is there a way to ​avoid the replicated code for SIOCSTXRINGID
   (the ioctl handler, i suppose). Maybe make one function and
   call it from both ipv4 and ipv6, assuming there aren't other
   places like this.


Yes, could do that.



+ i am not particularly happy about the explosion of ioctls for
   setting and getting rates. Next we'll want to add scheduling,
   and intervals, and queue sizes and so on.
   For these commands outside the critical path it would be
   preferable a single command with an extensible structure.
   Bikeshed material i am sure.


There is only one IOCTL in the critical path and that is the IOCTL to 
change or update the TX ring ID. The other IOCTLs are in the 
non-critical path towards the if_ioctl() callback.


If we can merge the flowID and the txringid into one field, would it be 
acceptable to add an IOCTL to read/write this value for all sockets?


--HPS

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC] Add support for hardware transmit rate limiting queues [WAS: Add support for changing the flow ID of TCP connections]

2014-08-20 Thread Luigi Rizzo
On Wed, Aug 20, 2014 at 3:29 PM, Hans Petter Selasky h...@selasky.org
wrote:

 Hi Luigi,


 On 08/20/14 11:32, Luigi Rizzo wrote:

 On Wed, Aug 20, 2014 at 9:34 AM, Hans Petter Selasky h...@selasky.org
 wrote:

  Hi,

 A month has passed since the last e-mail on this topic, and in the
 meanwhile some new patches have been created and tested:

 Basically the approach has been changed a little bit:

 - The creation of hardware transmit rings has been made independent of
 the
 TCP stack. This allows firewall applications to forward traffic into
 hardware transmit rings aswell, and not only native TCP applications.
 This
 should be one more reason to get the feature into the kernel.
 ​...

 ​the patch seems to include only part of the generic code (ie no ioctls
 for manipulating the rates, no backend code). Do i miss something ?


 The IOCTLs for managing the rates are:

 SIOCARATECTL, SIOCSRATECTL, SIOCGRATECTL and SIOCDRATECTL

 And they go to the if_ioctl callback.​


​i really think these new 'advanced' features should go
through some ethtool-like API, not more ioctls.
We have a strong need to design and implement such
an API also to have a uniform mechanism to manipulate
rss, queues and other NIC features.


​...​



 I have a few comments/concerns:

 + looks like flowid and txringid are overlapped in scope,
both will be used (in the backend) to select a specific
tx queue. I don't have a solution but would like to know
how do you plan to address this -- does one have priority
over the other, etc.


 Not 100% . In some cases the flowID is used differently than the txringid,
 though it might be possible to join the two. Would need to investigate
 current users of the flow ID.


​in some 10G drivers i have seen, at the driver
level the flowid is used on the tx path to assign
packets to a given ​tx queue, generally to improve
cpu affinity. Of course some applications
may want a true flow classifier so they do not
have to re-do the classification multiple times.
But then, we have a ton of different classifiers
with the same need -- e.g. ipfw dynamic rules,
dummynet pipe/queue id, divert ports...
Pipes are stored in mtags, which are very expensive
so i do see a point in embedding them in the mbufs,
it's just that going this path there is no end
to the list.




  + related to the above, a (possibly unavoidable) side effect
of this type of changes is that mbufs explode with custom fields,
so if we could perhaps make one between flowid and txringid,
that would be useful.


 Right, but ratecontrol is an in-general useful feature, especially for
 high throughput networks, or do you think otherwise?


of course i think ​the feature is useful,
but see the previous point. We should find a
way to manage it (and others) that does not pollute
or require continuous changes to the struct mbuf.




 + i am not particularly happy about the explosion of ioctls for
setting and getting rates. Next we'll want to add scheduling,
and intervals, and queue sizes and so on.
For these commands outside the critical path it would be
preferable a single command with an extensible structure.
Bikeshed material i am sure.


 There is only one IOCTL in the critical path and that is the IOCTL to
 change or update the TX ring ID. The other IOCTLs are in the non-critical
 path towards the if_ioctl() callback.

 If we can merge the flowID and the txringid into one field, would it be
 acceptable to add an IOCTL to read/write this value for all sockets?


​see above. i'd prefer an ethtool-like solution.

cheers
luigi
​
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC] Add support for hardware transmit rate limiting queues [WAS: Add support for changing the flow ID of TCP connections]

2014-08-20 Thread K. Macy
On Wed, Aug 20, 2014 at 7:41 AM, Luigi Rizzo ri...@iet.unipi.it wrote:

 On Wed, Aug 20, 2014 at 3:29 PM, Hans Petter Selasky h...@selasky.org
 wrote:

  Hi Luigi,
 
 
  On 08/20/14 11:32, Luigi Rizzo wrote:
 
  On Wed, Aug 20, 2014 at 9:34 AM, Hans Petter Selasky h...@selasky.org
  wrote:
 
   Hi,
 
  A month has passed since the last e-mail on this topic, and in the
  meanwhile some new patches have been created and tested:
 
  Basically the approach has been changed a little bit:
 
  - The creation of hardware transmit rings has been made independent of
  the
  TCP stack. This allows firewall applications to forward traffic into
  hardware transmit rings aswell, and not only native TCP applications.
  This
  should be one more reason to get the feature into the kernel.
  ​...
 
  ​the patch seems to include only part of the generic code (ie no ioctls
  for manipulating the rates, no backend code). Do i miss something ?
 
 
  The IOCTLs for managing the rates are:
 
  SIOCARATECTL, SIOCSRATECTL, SIOCGRATECTL and SIOCDRATECTL
 
  And they go to the if_ioctl callback.​


 ​i really think these new 'advanced' features should go
 through some ethtool-like API, not more ioctls.
 We have a strong need to design and implement such
 an API also to have a uniform mechanism to manipulate
 rss, queues and other NIC features.




There is no ethtool equivalent yet, but exposing them through a sysctl is
definitely the place to start before putting it straight in to ifconfig.
The ifnet API is already a bit of a mess.



 ​...​
 
 
 
  I have a few comments/concerns:
 
  + looks like flowid and txringid are overlapped in scope,
 both will be used (in the backend) to select a specific
 tx queue. I don't have a solution but would like to know
 how do you plan to address this -- does one have priority
 over the other, etc.
 
 
  Not 100% . In some cases the flowID is used differently than the
 txringid,
  though it might be possible to join the two. Would need to investigate
  current users of the flow ID.


 ​in some 10G drivers i have seen, at the driver
 level the flowid is used on the tx path to assign
 packets to a given ​tx queue, generally to improve
 cpu affinity. Of course some applications
 may want a true flow classifier so they do not
 have to re-do the classification multiple times.
 But then, we have a ton of different classifiers
 with the same need -- e.g. ipfw dynamic rules,
 dummynet pipe/queue id, divert ports...
 Pipes are stored in mtags, which are very expensive
 so i do see a point in embedding them in the mbufs,
 it's just that going this path there is no end
 to the list.



The purpose of the flowid was to enforce packet ordering on transmit while
being large enough to store a RSS hash, potentially allowing input
consumers to use it to semi-uniquely label (srcip, srcport, dstip, dstport)
tuples. It seems to that the txringid would be almost entirely redundant.
Why not just let users set the flowid?


 If we can merge the flowID and the txringid into one field, would it be
  acceptable to add an IOCTL to read/write this value for all sockets?


That sounds reasonable - although I have not thought through all the
implications.


-K
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC] Add support for hardware transmit rate limiting queues [WAS: Add support for changing the flow ID of TCP connections]

2014-08-20 Thread Navdeep Parhar

On 08/20/14 00:34, Hans Petter Selasky wrote:

Hi,

A month has passed since the last e-mail on this topic, and in the
meanwhile some new patches have been created and tested:

Basically the approach has been changed a little bit:

- The creation of hardware transmit rings has been made independent of
the TCP stack. This allows firewall applications to forward traffic into
hardware transmit rings aswell, and not only native TCP applications.
This should be one more reason to get the feature into the kernel.

- A hardware transmit ring basically can have two modes: FIXED-RATE or
AUTOMATIC-RATE. In the fixed rate mode all traffic is sent at a fixed
bytes per second rate. In the automatic mode you can configure a time
after which the TX queue must be empty. The hardware driver uses this to
configure the actual rate. In automatic mode you can also set an upper
and lower transmit rate limit.

- The MBUF has got a new field in the packet header: txringid

- IOCTLs for TCP v4 and v6 sockets has been updated to allow setting of
the txringid field in the mbuf.

The current patch [see attachment] should be much simpler and less
intrusive than the previous one.

Any comments ?



Here are some thoughts.  The first two bullets cover relatively
minor issues, the rest are more important.

- All of the mbuf pkthdr fields today have the same meaning no matter
  what the context.  It is not clear what txringid's global meaning is.
  Is it even possible for driver foo to interpret it the same way as
  driver bar?  What if the number of rings are different, or if the ring
  at the particular index for foo is setup differently than the ring at
  that same index for bar?  You are attempting to influence the driver's
  txq selection and traditionally the mbuf's flowid has been used for
  this purpose.  Have you considered allowing the user to set the flowid
  directly?  And mark it as such via a new rsstype so the kernel will
  leave it alone.

- uint32_t - m_flowid_t is plain gratuitous.  Now we need to include
  mbuf.h in more places just to get this definition.  What's the
  advantage of this?  style(9) isn't too fond of typedefs either.  Also,
  drivers *do* need to know the width of the flowid.  At least lagg(4)
  looks at the high bits of the flowid (see flowid_shift in lagg).  How
  high it can go depends on the width of the flowid.

- Interfaces can come and go, routes can change, and so the relationship
  between an inpcb and txringid is not stable at all.  What happens when
  the outbound route for an inpcb changes?

- The in_ratectlreq structure that you propose is inadequate in its
  current form.  For example, cxgbe's hardware can do rate limiting on a
  per-ring as well as per-connection basis, and it allows for pps,
  bandwidth, or min-max limits.  I think this is the critical piece that
  we NIC maintainers must agree on before any code hits the core kernel:
  how to express a rate-limit policy in a standard way and allow for
  hardware assistance opportunistically.  ipfw(4)'s dummynet is probably
  interested in this part too, so it's great that Luigi is paying
  attention to this thread.

- The RATECTL ioctls deal with in_ratectlreq so we need to standardize
  the ratectlreq structure before these ioctls can be considered generic
  ifnet ioctls.  This is the reason cxgbetool (and not ifconfig) has a
  private ioctl to frob cxgbe's per-queue rate-limiters.  I did not want
  to add ifnet ioctls that in reality were cxgbe only.  Ditto for i2c
  ioctls.  Now we have multiple drivers with i2c and melifaro@ is doing
  the right thing by promoting these private ioctls to a standard ifnet
  ioctl.  Have you considered a private mlxtool as a stop gap measure?

To summarize my take on all of this: we need a standard ratectlreq
structure, a standard way to associate an inpcb with one, and a standard
way to pass on this info to if_transmit.  After all this is in place we
could even have a dummynet-ish software layer that implements rate
limiters when the underlying hardware offers no assistance.

Regards,
Navdeep
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC] Add support for hardware transmit rate limiting queues [WAS: Add support for changing the flow ID of TCP connections]

2014-08-20 Thread Hans Petter Selasky

On 08/20/14 20:44, Navdeep Parhar wrote:

On 08/20/14 00:34, Hans Petter Selasky wrote:

Hi,

A month has passed since the last e-mail on this topic, and in the
meanwhile some new patches have been created and tested:

Basically the approach has been changed a little bit:

- The creation of hardware transmit rings has been made independent of
the TCP stack. This allows firewall applications to forward traffic into
hardware transmit rings aswell, and not only native TCP applications.
This should be one more reason to get the feature into the kernel.

- A hardware transmit ring basically can have two modes: FIXED-RATE or
AUTOMATIC-RATE. In the fixed rate mode all traffic is sent at a fixed
bytes per second rate. In the automatic mode you can configure a time
after which the TX queue must be empty. The hardware driver uses this to
configure the actual rate. In automatic mode you can also set an upper
and lower transmit rate limit.

- The MBUF has got a new field in the packet header: txringid

- IOCTLs for TCP v4 and v6 sockets has been updated to allow setting of
the txringid field in the mbuf.

The current patch [see attachment] should be much simpler and less
intrusive than the previous one.

Any comments ?



Here are some thoughts.  The first two bullets cover relatively
minor issues, the rest are more important.

- All of the mbuf pkthdr fields today have the same meaning no matter
   what the context.  It is not clear what txringid's global meaning is.
   Is it even possible for driver foo to interpret it the same way as
   driver bar?  What if the number of rings are different, or if the ring
   at the particular index for foo is setup differently than the ring at
   that same index for bar?  You are attempting to influence the driver's
   txq selection and traditionally the mbuf's flowid has been used for
   this purpose.  Have you considered allowing the user to set the flowid
   directly?  And mark it as such via a new rsstype so the kernel will
   leave it alone.


Hi,

At work so to speak, we have tried to make a simple approach that will 
not break existing code, without trying to optimise the possibilities 
and reduce memory footprint.




- uint32_t - m_flowid_t is plain gratuitous.  Now we need to include
   mbuf.h in more places just to get this definition.  What's the
   advantage of this?  style(9) isn't too fond of typedefs either.  Also,
   drivers *do* need to know the width of the flowid.  At least lagg(4)
   looks at the high bits of the flowid (see flowid_shift in lagg).  How
   high it can go depends on the width of the flowid.


The flowid should be typedef'ed. Else how can you know its type passing 
flowid along function arguments and so on?




- Interfaces can come and go, routes can change, and so the relationship
   between an inpcb and txringid is not stable at all.  What happens when
   the outbound route for an inpcb changes?


This is managed separately by a daemon or such. The problem about using 
the inpcb approach which you are suggesting, is that you limit the 
rate control feature to traffic which is bound by sockets. Can your way 
of doing rate control be useful to non-socket based firewall 
applications, for example?


You also assume a 1:1 mapping between inpcb and the flowID, right. 
What about M:N mappings, where multiple streams should share the same 
flowID, because it makes more sense?




- The in_ratectlreq structure that you propose is inadequate in its
   current form.  For example, cxgbe's hardware can do rate limiting on a
   per-ring as well as per-connection basis, and it allows for pps,
   bandwidth, or min-max limits.  I think this is the critical piece that
   we NIC maintainers must agree on before any code hits the core kernel:
   how to express a rate-limit policy in a standard way and allow for
   hardware assistance opportunistically.  ipfw(4)'s dummynet is probably
   interested in this part too, so it's great that Luigi is paying
   attention to this thread.


My in_ratectlreq is a work in progress.



- The RATECTL ioctls deal with in_ratectlreq so we need to standardize
   the ratectlreq structure before these ioctls can be considered generic
   ifnet ioctls.  This is the reason cxgbetool (and not ifconfig) has a
   private ioctl to frob cxgbe's per-queue rate-limiters.  I did not want
   to add ifnet ioctls that in reality were cxgbe only.  Ditto for i2c
   ioctls.  Now we have multiple drivers with i2c and melifaro@ is doing
   the right thing by promoting these private ioctls to a standard ifnet
   ioctl.  Have you considered a private mlxtool as a stop gap measure?


It might end that we need to create our own tool for this, having vendor 
specific IOCTLs, if we cannot agree how to do this in a general way.




To summarize my take on all of this: we need a standard ratectlreq
structure,


Agree.


a standard way to associate an inpcb with one,


Maybe.


and a standard
way to pass on this info to if_transmit.


Agree.


After all 

Re: [RFC] Add support for hardware transmit rate limiting queues [WAS: Add support for changing the flow ID of TCP connections]

2014-08-20 Thread Navdeep Parhar

On 08/20/14 12:25, Hans Petter Selasky wrote:

On 08/20/14 20:44, Navdeep Parhar wrote:

On 08/20/14 00:34, Hans Petter Selasky wrote:

Hi,

A month has passed since the last e-mail on this topic, and in the
meanwhile some new patches have been created and tested:

Basically the approach has been changed a little bit:

- The creation of hardware transmit rings has been made independent of
the TCP stack. This allows firewall applications to forward traffic into
hardware transmit rings aswell, and not only native TCP applications.
This should be one more reason to get the feature into the kernel.

- A hardware transmit ring basically can have two modes: FIXED-RATE or
AUTOMATIC-RATE. In the fixed rate mode all traffic is sent at a fixed
bytes per second rate. In the automatic mode you can configure a time
after which the TX queue must be empty. The hardware driver uses this to
configure the actual rate. In automatic mode you can also set an upper
and lower transmit rate limit.

- The MBUF has got a new field in the packet header: txringid

- IOCTLs for TCP v4 and v6 sockets has been updated to allow setting of
the txringid field in the mbuf.

The current patch [see attachment] should be much simpler and less
intrusive than the previous one.

Any comments ?



Here are some thoughts.  The first two bullets cover relatively
minor issues, the rest are more important.

- All of the mbuf pkthdr fields today have the same meaning no matter
   what the context.  It is not clear what txringid's global meaning is.
   Is it even possible for driver foo to interpret it the same way as
   driver bar?  What if the number of rings are different, or if the ring
   at the particular index for foo is setup differently than the ring at
   that same index for bar?  You are attempting to influence the driver's
   txq selection and traditionally the mbuf's flowid has been used for
   this purpose.  Have you considered allowing the user to set the flowid
   directly?  And mark it as such via a new rsstype so the kernel will
   leave it alone.


Hi,

At work so to speak, we have tried to make a simple approach that will
not break existing code, without trying to optimise the possibilities
and reduce memory footprint.



- uint32_t - m_flowid_t is plain gratuitous.  Now we need to include
   mbuf.h in more places just to get this definition.  What's the
   advantage of this?  style(9) isn't too fond of typedefs either.  Also,
   drivers *do* need to know the width of the flowid.  At least lagg(4)
   looks at the high bits of the flowid (see flowid_shift in lagg).  How
   high it can go depends on the width of the flowid.


The flowid should be typedef'ed. Else how can you know its type passing
flowid along function arguments and so on?


It's just a simple 32 bit unsigned int and all drivers know exactly what
it is.  I don't think we need type checking for trivial stuff like this.
We trust code to do the right thing and that's the correct tradeoff
here, in my opinion.  Or else we'd end up with errno_t, fd_t, etc. and
programming in C would not be fun anymore.  Here's a hyperbolic example:

errno_t socket(domain_t domain, socktype_t type, protocol_t protocol);

(oops, it returns an int -1 or 0 so errno_t is not strictly correct, but
you get my point).





- Interfaces can come and go, routes can change, and so the relationship
   between an inpcb and txringid is not stable at all.  What happens when
   the outbound route for an inpcb changes?


This is managed separately by a daemon or such. The problem about using
the inpcb approach which you are suggesting, is that you limit the
rate control feature to traffic which is bound by sockets. Can your way
of doing rate control be useful to non-socket based firewall
applications, for example?

You also assume a 1:1 mapping between inpcb and the flowID, right.
What about M:N mappings, where multiple streams should share the same
flowID, because it makes more sense?


You're right that an inpcb based scheme won't work for non-socket based
firewall.

inpcb represents an endpoint, almost always with an associated socket,
and it mostly has a 1:1 relation with an n-tuple (SO_LISTEN and UDP
sockets with no default destination are notable exceptions).  If you're
talking of non-socket based firewalls, then where is the inpcb coming
from?  Firewalls typically keep their own state for the n-tuples that
they are interested in.  It almost seems like you need a n-tuple -
rate_limit mapping scheme instead of inpcb - rate_limit.

Regards,
Navdeep





- The in_ratectlreq structure that you propose is inadequate in its
   current form.  For example, cxgbe's hardware can do rate limiting on a
   per-ring as well as per-connection basis, and it allows for pps,
   bandwidth, or min-max limits.  I think this is the critical piece that
   we NIC maintainers must agree on before any code hits the core kernel:
   how to express a rate-limit policy in a standard way and allow for
   hardware assistance 

Re: [RFC] Add support for hardware transmit rate limiting queues [WAS: Add support for changing the flow ID of TCP connections]

2014-08-20 Thread K. Macy



 - uint32_t - m_flowid_t is plain gratuitous.  Now we need to include
mbuf.h in more places just to get this definition.  What's the
advantage of this?  style(9) isn't too fond of typedefs either.  Also,
drivers *do* need to know the width of the flowid.  At least lagg(4)
looks at the high bits of the flowid (see flowid_shift in lagg).  How
high it can go depends on the width of the flowid.


 The flowid should be typedef'ed. Else how can you know its type passing
 flowid along function arguments and so on?



I agree with Navdeep. It's usage should be obvious  from context. This just
pollutes the namespace.




 - Interfaces can come and go, routes can change, and so the relationship
between an inpcb and txringid is not stable at all.  What happens when
the outbound route for an inpcb changes?


 This is managed separately by a daemon or such.


No it's not. Currently, unless you're using flowtables, the route and
llentry are looked up for every single outbound packet. Most users are
lightly enough loaded that they don't see the potential 8x reduction in pps
from the added overhead.




 The problem about using the inpcb approach which you are suggesting, is
 that you limit the rate control feature to traffic which is bound by
 sockets. Can your way of doing rate control be useful to non-socket based
 firewall applications, for example?

 You also assume a 1:1 mapping between inpcb and the flowID, right. What
 about M:N mappings, where multiple streams should share the same flowID,
 because it makes more sense?


That doesn't make any sense to me. FlowIDs are not a limited resource like
8-bit ASIDs where clever resource management was required. An M:N mapping
would permit arbitrary interleaving of multiple streams which simply
doesn't seem useful unless there is some critical case where it is a huge
performance win. In general it is adding complexity for a gratuitous
generalization.





 - The in_ratectlreq structure that you propose is inadequate in its
current form.  For example, cxgbe's hardware can do rate limiting on a
per-ring as well as per-connection basis, and it allows for pps,
bandwidth, or min-max limits.  I think this is the critical piece that
we NIC maintainers must agree on before any code hits the core kernel:
how to express a rate-limit policy in a standard way and allow for
hardware assistance opportunistically.  ipfw(4)'s dummynet is probably
interested in this part too, so it's great that Luigi is paying
attention to this thread.


 My in_ratectlreq is a work in progress.


Which means that it probably makes sense to not impinge upon the core
system until it is more refined.



 - The RATECTL ioctls deal with in_ratectlreq so we need to standardize
the ratectlreq structure before these ioctls can be considered generic
ifnet ioctls.  This is the reason cxgbetool (and not ifconfig) has a
private ioctl to frob cxgbe's per-queue rate-limiters.  I did not want
to add ifnet ioctls that in reality were cxgbe only.  Ditto for i2c
ioctls.  Now we have multiple drivers with i2c and melifaro@ is doing
the right thing by promoting these private ioctls to a standard ifnet
ioctl.  Have you considered a private mlxtool as a stop gap measure?


 It might end that we need to create our own tool for this, having vendor
 specific IOCTLs, if we cannot agree how to do this in a general way.



 To summarize my take on all of this: we need a standard ratectlreq
 structure,


 Agree.


  a standard way to associate an inpcb with one,


 Maybe.


Associating it with an inpcb doesn't exclude adding a mechanism for
supporting it in firewalls.

-K
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org