from:"Dave Taht"

Re: [RFC PATCH 1/1] dql: add dql_set_min_limit()

2021-03-09 Thread Dave Taht

I note that "proof" is very much in the developer's opinion and
limited testing base.

Actual operational experience, as in a real deployment, with other applications,
heavy context switching, or virtualization, might yield better results.

There's lots of defaults in the linux kernel that are just swags, the
default NAPI and rx/tx ring buffer sizes being two where devs just
copy/paste stuff, which either doesn't scale up, or doesn't scale
down.

This does not mean I oppose your patch! However I have two points I'd
like to make
regarding bql and dql in general that I have long longed be explored.

0) Me being an advocate of low latency in general, does mean that I
have no problem
and even prefer, starving the device rather than always keeping it busy.

/me hides

1) BQL is MIAD - multiplicative increase, additive decrease. While in
practice so far this does not seem to matter much (and also measuring
things down to "us" really hard), a stabler algorithm is AIMD. BQL
often absorbs a large TSO burst - usually a minimum of 128k is
observed on gbit, where a stabler state (without GSO) seemed to be
around 40k on many of the chipsets I worked with, back when I was
working in this area.

(cake's gso-splitting also gets lower bql values in general, if you
have enough cpu to run cake)

2) BQL + hardware mq is increasingly an issue in my mind in that, say,
you are hitting
64 hw queues, each with 128k stored in there, is additive, where in
order to service interrupts properly and keep the media busy might
only require 128k total, spread across the active queues and flows. I
have often thought that making BQL scale better to multiple hw queues
by globally sharing the buffering state(s), would lead to lower
latency, but
also that probably sharing that state would be too high overhead.

Having not worked out a solution to 2), and preferring to start with
1), and not having a whole lot of support for item 0) in the world, I
just thought I'd mention it, in the hope
someone might give it a go.

Re: [PATCH net-next v2] selftests: add IPv4 unicast extensions tests

2021-01-26 Thread Dave Taht

 +#
> +# Or 255.255.255/24
> +segmenttest 255.255.255.1 255.255.255.254 24 "assign and ping inside 
> 255.255.255/24 (is allowed)"
> +#
> +# Routing between different networks
> +route_test 240.5.6.7 240.5.6.1  255.1.2.1255.1.2.3  24 "route 
> between 240.5.6/24 and 255.1.2/24 (is allowed)"
> +route_test 0.200.6.7 0.200.38.1 245.99.101.1 245.99.200.111 16 "route 
> between 0.200/16 and 245.99/16 (is allowed)"
> +#
> +# ==
> +#  TESTS THAT CURRENTLY EXPECT FAILURE =
> +# ==
> +expect_failure=true
> +# It should still not be possible to use 0.0.0.0 or 255.255.255.255
> +# as a unicast address.  Thus, these tests expect failure.
> +segmenttest 0.0.1.5   0.0.0.0 16 "assigning 0.0.0.0 (is 
> forbidden)"
> +segmenttest 255.255.255.1 255.255.255.255 16 "assigning 255.255.255.255 (is 
> forbidden)"
> +#
> +# Test support for not having all of 127 be loopback
> +# Currently Linux does not allow this, so this should fail too
> +segmenttest 127.99.4.5 127.99.4.6 16 "assign and ping inside 127/8 (is 
> forbidden)"
> +#
> +# Test support for lowest host
> +# Currently Linux does not allow this, so this should fail too
> +segmenttest 5.10.15.20 5.10.15.0 24 "assign and ping lowest host (is 
> forbidden)"
> +#
> +# Routing using lowest host as a gateway/endpoint
> +# Currently Linux does not allow this, so this should fail too
> +route_test 192.168.42.1 192.168.42.0 9.8.7.6 9.8.7.0 24 "routing using 
> lowest host (is forbidden)"
> +#
> +# Test support for unicast use of class D
> +# Currently Linux does not allow this, so this should fail too
> +segmenttest 225.1.2.3 225.1.2.200 24 "assign and ping class D address (is 
> forbidden)"
> +#
> +# Routing using class D as a gateway
> +route_test 225.1.42.1 225.1.42.2 9.8.7.6 9.8.7.1 24 "routing using class D 
> (is forbidden)"
> +#
> +# Routing using 127/8
> +# Currently Linux does not allow this, so this should fail too
> +route_test 127.99.2.3 127.99.2.4 200.1.2.3 200.1.2.4 24 "routing using 127/8 
> (is forbidden)"
> +#
> +unset expect_failure
> +# =
> +#  END OF TESTS THAT CURRENTLY EXPECT FAILURE =
> +# =
> +exit ${result}
> --
> 2.25.1
>

Acked-By: Dave Taht 

-- 
"For a successful technology, reality must take precedence over public
relations, for Mother Nature cannot be fooled" - Richard Feynman

d...@taht.net  CTO, TekLibre, LLC Tel: 1-831-435-0729

Re: [RFC v2] current devlink extension plan for NICs

2020-05-10 Thread Dave Taht

On Sun, May 10, 2020 at 7:46 AM Jiri Pirko  wrote:
>
> Hello guys.
>
> Anyone has any opinion on the proposal? Or should I take it as a silent
> agreement? :)
>
> We would like to go ahead and start sending patchsets.

I gotta say that the whole thing makes my head really hurt, and while
this conversation is about how to go about configuring things,
I've been unable to get a grip on how flows will actually behave with
these offloads present.

My overall starting point for thinking about this stuff was described
in this preso to broadcom a few years back:
http://flent-fremont.bufferbloat.net/~d/broadcom_aug9.pdf

More recently I did what I think is my funniest talk ever on these
subjects: 
https://blog.apnic.net/2020/01/22/bufferbloat-may-be-solved-but-its-not-over-yet/

Make some popcorn, take a look. :) I should probably have covered
ecn's (mis)behaviors at the end, but I didn't.

Steven hemminger's lca talk on these subjects was also a riot...

so somehow going from my understanding of how stuff gets configured,
to the actual result, is needed, for me to have any opinion at all.
You
have this stuff basically running already? Can you run various
flent.org tests through it?

>
> Thanks!

-- 
"For a successful technology, reality must take precedence over public
relations, for Mother Nature cannot be fooled" - Richard Feynman

d...@taht.net  CTO, TekLibre, LLC Tel: 1-831-435-0729

Re: [PATCH net] wireguard: Use tunnel helpers for decapsulating ECN markings

2020-04-28 Thread Dave Taht

On Tue, Apr 28, 2020 at 2:10 AM Toke Høiland-Jørgensen  wrote:
>
> "Rodney W. Grimes"  writes:
>
> > Replying to a single issue I am reading, and really hope I
> > am miss understanding.  I am neither a wireguard or linux
> > user so I may be miss understanding what is said.
> >
> > Inline at {RWG}
> >
> >> "Jason A. Donenfeld"  writes:
> >>
> >> > Hey Toke,
> >> >
> >> > Thanks for fixing this. I wasn't aware there was a newer ECN RFC. A
> >> > few comments below:
> >> >
> >> > On Mon, Apr 27, 2020 at 8:47 AM Toke H?iland-J?rgensen  
> >> > wrote:
> >> >> RFC6040 also recommends dropping packets on certain combinations of
> >> >> erroneous code points on the inner and outer packet headers which 
> >> >> shouldn't
> >> >> appear in normal operation. The helper signals this by a return value > 
> >> >> 1,
> >> >> so also add a handler for this case.
> >> >
> >> > This worries me. In the old implementation, we propagate some outer
> >> > header data to the inner header, which is technically an authenticity
> >> > violation, but minor enough that we let it slide. This patch here
> >> > seems to make that violation a bit worse: namely, we're now changing
> >> > the behavior based on a combination of outer header + inner header. An
> >> > attacker can manipulate the outer header (set it to CE) in order to
> >> > learn whether the inner header was CE or not, based on whether or not
> >> > the packet gets dropped, which is often observable. That's some form
> >
> > Why is anyone dropping on decap over the CE bit?  It should be passed
> > on, not lead to a packet drop.  If the outer header is CE on an inner
> > header of CE it should just continue to be a CE, dropping it is actually
> > breaking the purpose of the CE codepoint, to signal congestion before
> > having to cause a packet loss.
> >
> >> > of an oracle, which I'm not too keen on having in wireguard. On the
> >> > other hand, we pretty much already _explicitly leak this bit_ on tx
> >> > side -- in send.c:
> >> >
> >> > PACKET_CB(skb)->ds = ip_tunnel_ecn_encap(0, ip_hdr(skb), skb); // inner 
> >> > packet
> >> > ...
> >> > wg_socket_send_skb_to_peer(peer, skb, PACKET_CB(skb)->ds); // outer 
> >> > packet
> >> >
> >> > We considered that leak a-okay. But a decryption oracle seems slightly
> >> > worse than an explicit and intentional leak. But maybe not that much
> >> > worse.
> >>
> >> Well, seeing as those two bits on the outer header are already copied
> >> from the inner header, there's no additional leak added by this change,
> >> is there? An in-path observer could set CE and observe that the packet
> >> gets dropped, but all they would learn is that the bits were zero
> >
> > Again why is CE leading to anyone dropping?
> >
> >> (non-ECT). Which they already knew because they could just read the bits
> >> directly from the header.
> >>
> >> Also note, BTW, that another difference between RFC 3168 and 6040 is the
> >> propagation of ECT(1) from outer to inner header. That's not actually
> >> done correctly in Linux ATM, but I sent a separate patch to fix this[0],
> >> which Wireguard will also benefit from with this patch.

I note that there is a large ISP in argentina that has been
mis-marking most udp & tcp traffic
as CE for years now and despite many attempts to get 'em to fix it,
when last I checked (2? 3?)
months back, they still were doing it.

My impression of overall competence and veracity of multiple transit
and isp providers has been sorely
tried recently. While I support treating ect 1 and 2 properly, I am
inclined to start thinking that
ce on a non-ect encapsulated packet is something that should not be dropped.

but, whatever is decided on that front is in the hooks in the other
patch above, not in wireguard,
and I'll make the same comment there.



> >
> > Thanks for this.
> >
> >>
> >> > I wanted to check with you: is the analysis above correct? And can you
> >> > somehow imagine the ==2 case leading to different behavior, in which
> >> > the packet isn't dropped? Or would that ruin the "[de]congestion" part
> >> > of ECN? I just want to make sure I understand the full picture before
> >> > moving in one direction or another.
> >>
> >> So I think the logic here is supposed to be that if there are CE marks
> >> on the outer header, then an AQM somewhere along the path has marked the
> >> packet, which is supposed to be a congestion signal, which we want to
> >> propagate all the way to the receiver (who will then echo it back to the
> >> receiver). However, if the inner packet is non-ECT then we can't
> >> actually propagate the ECN signal; and a drop is thus the only
> >> alternative congestion signal available to us.
> >
> > You cannot get a CE mark on the outer packet if the inner packet is
> > not ECT, as the outer packet would also be not ECT and thus not
> > eligible for CE mark.  If you get the above sited condition something
> > has gone *wrong*.
>
> Yup, you're quite right. If everything is working correctly, this should
> nev

Re: [PATCH] ath10k: increase rx buffer size to 2048

2020-04-28 Thread Dave Taht

On Tue, Apr 28, 2020 at 5:06 AM Kalle Valo  wrote:
>
> Sven Eckelmann  writes:
>
> > On Wednesday, 1 April 2020 09:00:49 CEST Sven Eckelmann wrote:
> >> On Wednesday, 5 February 2020 20:10:43 CEST Linus Lüssing wrote:
> >> > From: Linus Lüssing 
> >> >
> >> > Before, only frames with a maximum size of 1528 bytes could be
> >> > transmitted between two 802.11s nodes.
> >> >
> >> > For batman-adv for instance, which adds its own header to each frame,
> >> > we typically need an MTU of at least 1532 bytes to be able to transmit
> >> > without fragmentation.
> >> >
> >> > This patch now increases the maxmimum frame size from 1528 to 1656
> >> > bytes.
> >> [...]
> >>
> >> @Kalle, I saw that this patch was marked as deferred [1] but I couldn't 
> >> find
> >> any mail why it was done so. It seems like this currently creates real 
> >> world
> >> problems - so would be nice if you could explain shortly what is currently
> >> blocking its acceptance.
> >
> > Ping?
>
> Sorry for the delay, my plan was to first write some documentation about
> different hardware families but haven't managed to do that yet.
>
> My problem with this patch is that I don't know what hardware and
> firmware versions were tested, so it needs analysis before I feel safe
> to apply it. The ath10k hardware families are very different that even
> if a patch works perfectly on one ath10k hardware it could still break
> badly on another one.
>
> What makes me faster to apply ath10k patches is to have comprehensive
> analysis in the commit log. This shows me the patch author has
> considered about all hardware families, not just the one he is testing
> on, and that I don't need to do the analysis myself.

I have been struggling to get the ath10k to sing and dance using
various variants
of the firmware, on this bug over here:

https://forum.openwrt.org/t/aql-and-the-ath10k-is-lovely/

The puzzling thing is the loss of bidirectional throughput at codel target 20,
and getting WAY more (but less than I expected) at codel target 5.

This doesn't quite have bearing the size of the rx ring, except that in my
experiments the rx ring is rather small!! and yet I get way more performance
out of it

(still,  as you'll see from the bug, it's WAY better than it used to be)

is NAPI in this driver? I'm afraid to look.
> --
> https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches



-- 
Make Music, Not War

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-435-0729

Re: [PATCH v4] net: sch_generic: Use pfifo_fast as fallback scheduler for CAN hardware

2019-10-23 Thread Dave Taht

On Wed, Oct 23, 2019 at 3:52 AM Vincent Prince
 wrote:
>
> There is networking hardware that isn't based on Ethernet for layers 1 and 2.
>
> For example CAN.
>
> CAN is a multi-master serial bus standard for connecting Electronic Control
> Units [ECUs] also known as nodes. A frame on the CAN bus carries up to 8 bytes
> of payload. Frame corruption is detected by a CRC. However frame loss due to
> corruption is possible, but a quite unusual phenomenon.
>
> While fq_codel works great for TCP/IP, it doesn't for CAN. There are a lot of
> legacy protocols on top of CAN, which are not build with flow control or high
> CAN frame drop rates in mind.
>
> When using fq_codel, as soon as the queue reaches a certain delay based 
> length,
> skbs from the head of the queue are silently dropped. Silently meaning that 
> the
> user space using a send() or similar syscall doesn't get an error. However
> TCP's flow control algorithm will detect dropped packages and adjust the
> bandwidth accordingly.
>
> When using fq_codel and sending raw frames over CAN, which is the common use
> case, the user space thinks the package has been sent without problems, 
> because
> send() returned without an error. pfifo_fast will drop skbs, if the queue
> length exceeds the maximum. But with this scheduler the skbs at the tail are
> dropped, an error (-ENOBUFS) is propagated to user space. So that the user
> space can slow down the package generation.
>
> On distributions, where fq_codel is made default via CONFIG_DEFAULT_NET_SCH
> during compile time, or set default during runtime with sysctl
> net.core.default_qdisc (see [1]), we get a bad user experience. In my test 
> case
> with pfifo_fast, I can transfer thousands of million CAN frames without a 
> frame
> drop. On the other hand with fq_codel there is more then one lost CAN frame 
> per
> thousand frames.
>
> As pointed out fq_codel is not suited for CAN hardware, so this patch changes
> attach_one_default_qdisc() to use pfifo_fast for "ARPHRD_CAN" network devices.
>
> During transition of a netdev from down to up state the default queuing
> discipline is attached by attach_default_qdiscs() with the help of
> attach_one_default_qdisc(). This patch modifies attach_one_default_qdisc() to
> attach the pfifo_fast (pfifo_fast_ops) if the network device type is
> "ARPHRD_CAN".
>
> [1] https://github.com/systemd/systemd/issues/9194
>
> Suggested-by: Marc Kleine-Budde 
> Signed-off-by: Marc Kleine-Budde 
> Signed-off-by: Vincent Prince 
> ---
> Changes in v4:
>  - add Marc credit to commit log
>
> Changes in v3:
>  - add description
>
> Changes in v2:
>  - reformat patch
>
>  net/sched/sch_generic.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 77b289d..dfb2982 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1008,6 +1008,8 @@ static void attach_one_default_qdisc(struct net_device 
> *dev,
>
> if (dev->priv_flags & IFF_NO_QUEUE)
> ops = &noqueue_qdisc_ops;
> +   else if(dev->type == ARPHRD_CAN)
> +   ops = &pfifo_fast_ops;
>
> qdisc = qdisc_create_dflt(dev_queue, ops, TC_H_ROOT, NULL);
> if (!qdisc) {
> --
> 2.7.4
>

While I'm delighted to see such a simple patch emerge, openwrt long
ago patched out pfifo_fast. pfifo_fast has
additional semantics not needed in the can use case either (I think)
and "pfifo" is fine, but sure, pfifo_fast if you must.

anyway, regardless, that's an easy fix and I hope this fix goes to
stable, as I've had nightmares about cars exploding due to out of
order can bus operations ever since I learned of this bug.

Acked-by: Dave Taht 

-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

Re: [PATCH iproute2-next] Add tc-BPF example for a TCP pure ack recognizer

2019-10-22 Thread Dave Taht

On Mon, Oct 21, 2019 at 4:48 PM Stephen Hemminger
 wrote:
>
> On Tue,  1 May 2018 11:32:41 -0700
> Dave Taht  wrote:
>
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Copyright 2017 Google Inc.
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
> > + * 02110-1301, USA.
> > + */
> > +
>
> SPDX is enough don't add boilerplate.

I had posted this example code 2+ years ago, and, um, this was rather
a long time to await code review.

Do y'all actually want this?

-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

Re: dsa traffic priorization

2019-09-18 Thread Dave Taht

On Wed, Sep 18, 2019 at 7:37 AM Vladimir Oltean  wrote:
>
> Hi Sascha,
>
> On Wed, 18 Sep 2019 at 17:03, Sascha Hauer  wrote:
> >
> > Hi All,
> >
> > We have a customer using a Marvell 88e6240 switch with Ethercat on one port 
> > and
> > regular network traffic on another port. The customer wants to configure 
> > two things
> > on the switch: First Ethercat traffic shall be priorized over other network 
> > traffic
> > (effectively prioritizing traffic based on port). Second the ethernet 
> > controller
> > in the CPU is not able to handle full bandwidth traffic, so the traffic to 
> > the CPU
> > port shall be rate limited.
> >
>
> You probably already know this, but egress shaping will not drop
> frames, just let them accumulate in the egress queue until something
> else happens (e.g. queue occupancy threshold triggers pause frames, or
> tail dropping is enabled, etc). Is this what you want? It sounds a bit

Dropping in general is a basic attribute of the fq_codel algorithm which is
enabled by default on many boxes. It's latency sensitive, so it responds well
to pause frame (over) use.

Usually the cpu to switch port is exposed via vlan (e.g eth0:2), and
while you can inbound and
outbound shape on that - using htb/hfsc +  fq_codel, or cake

But, also, most usually what happens when the cpu cannot keep up with
the switch is we drop packets on the rx ring for receive, and in
fq-codel on send.

> strange to me to configure egress shaping on the CPU port of a DSA
> switch. That literally means you are buffering frames inside the
> system. What about ingress policing?
>
> > For reference the patch below configures the switch to their needs. Now the 
> > question
> > is how this can be implemented in a way suitable for mainline. It looks 
> > like the per
> > port priority mapping for VLAN tagged packets could be done via ip link add 
> > link ...
> > ingress-qos-map QOS-MAP. How the default priority would be set is unclear 
> > to me.
> >
>
> Technically, configuring a match-all rxnfc rule with ethtool would
> count as 'default priority' - I have proposed that before. Now I'm not
> entirely sure how intuitive it is, but I'm also interested in being
> able to configure this.
>
> > The other part of the problem seems to be that the CPU port has no network 
> > device
> > representation in Linux, so there's no interface to configure the egress 
> > limits via tc.
> > This has been discussed before, but it seems there hasn't been any 
> > consensous regarding how
> > we want to proceed?
> >
> > Sascha
> >
> > -8<---
> >
> >  drivers/net/dsa/mv88e6xxx/chip.c | 54 +++-
> >  drivers/net/dsa/mv88e6xxx/port.c | 87 
> >  drivers/net/dsa/mv88e6xxx/port.h | 19 +++
> >  3 files changed, 159 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/dsa/mv88e6xxx/chip.c 
> > b/drivers/net/dsa/mv88e6xxx/chip.c
> > index d0a97eb73a37..2a15cf259d04 100644
> > --- a/drivers/net/dsa/mv88e6xxx/chip.c
> > +++ b/drivers/net/dsa/mv88e6xxx/chip.c
> > @@ -2090,7 +2090,9 @@ static int mv88e6xxx_setup_port(struct mv88e6xxx_chip 
> > *chip, int port)
> >  {
> >  struct dsa_switch *ds = chip->ds;
> >  int err;
> > +u16 addr;
> >  u16 reg;
> > +u16 val;
> >
> >  chip->ports[port].chip = chip;
> >  chip->ports[port].port = port;
> > @@ -2246,7 +2248,57 @@ static int mv88e6xxx_setup_port(struct 
> > mv88e6xxx_chip *chip, int port)
> >  /* Default VLAN ID and priority: don't set a default VLAN
> >   * ID, and set the default packet priority to zero.
> >   */
> > -return mv88e6xxx_port_write(chip, port, 
> > MV88E6XXX_PORT_DEFAULT_VLAN, 0);
> > +err = mv88e6xxx_port_write(chip, port, 
> > MV88E6XXX_PORT_DEFAULT_VLAN, 0);
> > +if (err)
> > +return err;
> > +
> > +#define SWITCH_CPU_PORT 5
> > +#define SWITCH_ETHERCAT_PORT 3
> > +
> > +/* set the egress rate */
> > +switch (port) {
> > +case SWITCH_CPU_PORT:
> > +err = mv88e6xxx_port_set_egress_rate(chip, port,
> > +
> > MV88E6XXX_PORT_EGRESS_RATE_CTL2_COUNT_MODE_FRAME, 3);
> > +break;
> > +default:
> > +err = mv88e6xxx_port_set_egress_rate(chip, port,
> > +
> > MV88E6XXX_PORT_EGRESS_RATE_CTL2_COUNT_MODE_FRAME, 0);
> > +break;
> > +}
> > +
> > +if (err)
> > +return err;
> > +
> > +/* set the output queue usage */
> > +switch (port) {
> > +case SWITCH_CPU_PORT:
> > +err = 
> > mv88e6xxx_port_set_output_queue_schedule(chip, port,
> > +
> > MV88E6XXX_PORT_EGRESS_RATE_CTL2_SCHEDULE_Q3_STRICT);
> > +

Re: "[RFC PATCH net-next 2/2] Reduce localhost to 127.0.0.0/16"

2019-09-13 Thread Dave Taht

On Fri, Sep 13, 2019 at 9:54 AM Mark Smith  wrote:
>
> (Not subscribed to the ML)
>
> Hi,
>
> I've noticed this patch. I don't think it should be applied, as it
> contradicts RFC 1122, "Requirements for Internet Hosts --
> Communication Layers":

Yea!  I kicked off a discussion!

> "(g)  { 127,  }
>
>  Internal host loopback address.  Addresses of this form
>  MUST NOT appear outside a host."

That 1984 (89) definition of a "host" has been stretched considerably
in the past few decades. We now have
a hypervisor, multiple cores, multiple vms, vms stacked within vms,
and containers with virtual interfaces on them, and a confusing
plethora of rfc1918 and nat between them and the wire.

This RFC-to-netdev's proposed reduction to a /16 was sufficient to
cover the two main use cases for loopback in Linux,
127.0.0.1 - loopback
127.0.1.1 - dns

We'd also seen some usages of things like 127.0.0.53 and so on, and in
the discussion at linuxconf last week,
it came out that cumulus and a few others were possibly using high
values of 127.x for switch chassis addressing, but we haven't got any
documentation on how that works yet.

The 1995 IPv6 standard and later has only one loopback address.
127.0.0.0/8 is 16m wasted internal to the host addresses.

> RFC 1122 is one of the relatively few Internet Standards, specifically
> Standard Number 3:
>
> https://www.rfc-editor.org/standards

We have been exploring the solution space here:

https://github.com/dtaht/unicast-extensions/blob/master/rfcs/draft-gilmore-taht-v4uniext.txt

If you would like to file more comments and bugs - or discuss here!
that would be great.

>
> Regards,
> Mark.

-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

Re: [PATCH v3 2/2] tcp: Add rcv_wnd to TCP_INFO

2019-09-12 Thread Dave Taht

On Thu, Sep 12, 2019 at 1:59 AM Neal Cardwell  wrote:
>
> On Wed, Sep 11, 2019 at 6:32 PM Thomas Higdon  wrote:
> >
> > Neal Cardwell mentioned that rcv_wnd would be useful for helping
> > diagnose whether a flow is receive-window-limited at a given instant.
> >
> > This serves the purpose of adding an additional __u32 to avoid the
> > would-be hole caused by the addition of the tcpi_rcvi_ooopack field.
> >
> > Signed-off-by: Thomas Higdon 
> > ---
>
> Thanks, Thomas.
>
> I know that when I mentioned this before I mentioned the idea of both
> tp->snd_wnd (send-side receive window) and tp->rcv_wnd (receive-side
> receive window) in tcp_info, and did not express a preference between
> the two. Now that we are faced with a decision between the two,
> personally I think it would be a little more useful to start with
> tp->snd_wnd. :-)
>
> Two main reasons:
>
> (1) Usually when we're diagnosing TCP performance problems, we do so
> from the sender, since the sender makes most of the
> performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
> From the sender-side the thing that would be most useful is to see
> tp->snd_wnd, the receive window that the receiver has advertised to
> the sender.

I am under the impression, that particularly in the mobile space, that
network behavior
is often governed by rcv_wnd. At least, there's been so many papers on
this that I'd
tended to assume so.

Given a desire to do both vars, is there a *third* u32 we could add to
fill in the next hole? :)
ecn marks?

>
> (2) From the receiver side, "ss" can already show a fair amount of
> info about receive-side buffer/window limits, like:
> info->tcpi_rcv_ssthresh, info->tcpi_rcv_space,
> skmeminfo[SK_MEMINFO_RMEM_ALLOC], skmeminfo[SK_MEMINFO_RCVBUF]. Often
> the rwin can be approximated by combining those.
>
> Hopefully Eric, Yuchung, and Soheil can weigh in on the question of
> snd_wnd vs rcv_wnd. Or we can perhaps think of another field, and add
> the tcpi_rcvi_ooopack, snd_wnd, rcv_wnd, and that final field, all
> together.
>
> thanks,
> neal



-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

[RFC PATCH net-next 1/2] Allow 225/8-231/8 as unicast

2019-09-09 Thread Dave Taht

This patch converts the long "reserved for future use" multicast
address space, 225/8-231/8 - 120m addresses - for use as unicast
addresses instead.

In a comprehensive survey of all the open source code on the planet
we found no users of this range. We found some official and unofficial
usage of addresses in 224/8 and in 239/8 - both spaces at well under
50% allocation in the first place, so we anticipate no additional growth
for any reason, into the 225-231 spaces.

There will be some short term incompatabilities induced.

The principal flaw of converting this space to unicast involves
a non-uniext box, sending a packet to the formerly multicast address, 
and the reply coming back from that "formerly multicast" address
as unicast.

The return packet will be dropped because the source of the reply is unicast
(L2) with what the non-uniext box considers to be multicast (L3).

and, like all multicast packets sent anywhere, the attempt will still
flood all ports on the local switch.

A tcp attempt fails immediately due to the inherent IN_MULTICAST
check in the existing kernel. Some stacks (not linux) MAY do more 
of the wrong thing here.

As for userspace exposure...

We were only able to find 89 packages in fedora that used the IN_MULTICAST
macro. Currently the plan is not to kill IN_MULTICAST, (as doing it right
requires access to the big endian macros) but retire its usages in
the kernel (already done) and then the very few programs that use it userspace.

All the routing daemons we've inspected and modified don't use IN_MULTICAST.
The patches to them are trivial.

New users of multicast, seem to always pick something out of the 224/8
or 239/8 ranges, which are untouched by this patch.

Additional potential problems include: 

* hardware offloads that explicitly check for multicast
* binary firmware that explicitly checks for multicast
* a tiny cpu hit

Whether or not these problems are worth addressing to regain 120m
useful unicast addresses in the next decade is up for debate.

---
 include/linux/in.h | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 1873ef642605..8665842a3589 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -42,7 +42,10 @@ static inline bool ipv4_is_loopback(__be32 addr)
 
 static inline bool ipv4_is_multicast(__be32 addr)
 {
-   return (addr & htonl(0xf000)) == htonl(0xe000);
+   if((addr & htonl(0xf000)) == htonl(0xe000))
+   return !((htonl(addr) >= 0xe100) &&
+(htonl(addr) < 0xe800));
+   return 0;
 }
 
 static inline bool ipv4_is_local_multicast(__be32 addr)
-- 
2.17.1

[RFC PATCH net-next 2/2] Reduce localhost to 127.0.0.0/16

2019-09-09 Thread Dave Taht

The 127 concept of "localhost" was  basically a straight port over from
original ARPANET behavior. At the time it was envisioned that
many services would exist in the mainframe that would need to be
individually addressable, and long predated alternative approaches
such as tipc, etc.

This reduces 127.0.0.0/8 to a /16, and opens up 127.1.0.0 and above
for use as real, unicast addresses either on the wire or internal
to containers, virtual machines, vpns, etc.

The only major uses of 127 to date have been (of
course) 127.0.0.1 and 127.0.1.1 (for dns), and various adhoc uses
for things like anti-spam daemons.

Linux also has a non-standard treatment of the entire 127 address
space - anything sent to any address there "just works". This patch
preserves that behavior for 127.0.0.0/16 only.

Other uses, such as ntp's 127.127 "handle" for it, doesn't actually
touch the stack.  We've seen 127.x used for chassis access and a few
other explicit uses in the field, but very little.

There is some small hope that we can preserve the original intent of
"localhost" in an age of stacked vms and containers, where instead of
using rfc1918 nat at each level, we can route, instead.


---
 include/linux/in.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 8665842a3589..6e4f37e5bf51 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -37,7 +37,9 @@ static inline int proto_ports_offset(int proto)
 
 static inline bool ipv4_is_loopback(__be32 addr)
 {
-   return (addr & htonl(0xff00)) == htonl(0x7f00);
+   if((addr & htonl(0xff00)) == htonl(0x7f00))
+   return( addr & htonl(0x00ff)) == 0;
+   return 0;
 }
 
 static inline bool ipv4_is_multicast(__be32 addr)
-- 
2.17.1

[RFC PATCH net-next 0/2] more IPv4 unicast extensions

2019-09-09 Thread Dave Taht

Adding support in linux for the 240/4 and 0/8 ipv4 address ranges were
easy, no brainer removals of obsolete support for obsolete
specifications. Has anyone noticed yet?

The following two patches are intended as discussion points fot
my https://linuxplumbersconf.org/event/4/contributions/457/
talk today.

Dave Taht (2):
  Allow 225/8-231/8 as unicast
  Reduce localhost to a /16

 include/linux/in.h | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

-- 
2.17.1

Re: how to search for the best route from userspace in netlink?

2019-09-03 Thread Dave Taht

On Mon, Sep 2, 2019 at 8:13 PM David Ahern  wrote:
>
> On 9/2/19 4:07 PM, Dave Taht wrote:
> > Windows has the "RtmGetMostSpecificDestination" call:
> > https://docs.microsoft.com/en-us/windows/win32/rras/search-for-the-best-route
> >
> > In particular, I wanted to search for the best route, AND pick up the
> > PMTU from that (if it existed)
> > for older UDP applications like dnssec[1] and newer ones like QUIC[2].
>
> RTM_GETROUTE with data for the route lookup. See iproute2 code as an
> example.

Yes. I really didn't describe my thinking very well. It's coping with
pmtu better
in the case of a more increasingly udp'd and tunneled internet. tcp
(being kernel based)
will do the probe and cache that attribute of the path, udp does not.
A udp based app with root privs could be setting it after figuring it
out,  a userspace one cannot.

for more detail from server-al sides of that philosophical debate,
please see the links I posted
originally.

-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

[PATCH net-next] Convert usage of IN_MULTICAST to ipv4_is_multicast

2019-09-02 Thread Dave Taht

IN_MULTICAST's primary intent is as a uapi macro.

Elsewhere in the kernel we use ipv4_is_multicast consistently.

This patch unifies linux's multicast checks to use that function
rather than this macro.

Signed-off-by: Dave Taht 
Reviewed-by: Toke Høiland-Jørgensen 

---
 drivers/net/geneve.c | 2 +-
 include/net/vxlan.h  | 4 ++--
 net/rds/af_rds.c | 4 ++--
 net/rds/bind.c   | 4 ++--
 net/rds/send.c   | 4 ++--
 5 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index cb2ea8facd8d..3ab24fdccd3b 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -1345,7 +1345,7 @@ static int geneve_nl2info(struct nlattr *tb[], struct 
nlattr *data[],
info->key.u.ipv4.dst =
nla_get_in_addr(data[IFLA_GENEVE_REMOTE]);
 
-   if (IN_MULTICAST(ntohl(info->key.u.ipv4.dst))) {
+   if (ipv4_is_multicast(info->key.u.ipv4.dst)) {
NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_REMOTE],
"Remote IPv4 address cannot be 
Multicast");
return -EINVAL;
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index dc1583a1fb8a..335283dbe9b3 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -391,7 +391,7 @@ static inline bool vxlan_addr_multicast(const union 
vxlan_addr *ipa)
if (ipa->sa.sa_family == AF_INET6)
return ipv6_addr_is_multicast(&ipa->sin6.sin6_addr);
else
-   return IN_MULTICAST(ntohl(ipa->sin.sin_addr.s_addr));
+   return ipv4_is_multicast(ipa->sin.sin_addr.s_addr);
 }
 
 #else /* !IS_ENABLED(CONFIG_IPV6) */
@@ -403,7 +403,7 @@ static inline bool vxlan_addr_any(const union vxlan_addr 
*ipa)
 
 static inline bool vxlan_addr_multicast(const union vxlan_addr *ipa)
 {
-   return IN_MULTICAST(ntohl(ipa->sin.sin_addr.s_addr));
+   return ipv4_is_multicast(ipa->sin.sin_addr.s_addr);
 }
 
 #endif /* IS_ENABLED(CONFIG_IPV6) */
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 2977137c28eb..1a5bf3fa4578 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -559,7 +559,7 @@ static int rds_connect(struct socket *sock, struct sockaddr 
*uaddr,
ret = -EDESTADDRREQ;
break;
}
-   if (IN_MULTICAST(ntohl(sin->sin_addr.s_addr)) ||
+   if (ipv4_is_multicast(sin->sin_addr.s_addr) ||
sin->sin_addr.s_addr == htonl(INADDR_BROADCAST)) {
ret = -EINVAL;
break;
@@ -593,7 +593,7 @@ static int rds_connect(struct socket *sock, struct sockaddr 
*uaddr,
addr4 = sin6->sin6_addr.s6_addr32[3];
if (addr4 == htonl(INADDR_ANY) ||
addr4 == htonl(INADDR_BROADCAST) ||
-   IN_MULTICAST(ntohl(addr4))) {
+   ipv4_is_multicast(addr4)) {
ret = -EPROTOTYPE;
break;
}
diff --git a/net/rds/bind.c b/net/rds/bind.c
index 0f4398e7f2a7..6dbb763bc1fd 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -181,7 +181,7 @@ int rds_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
if (addr_len < sizeof(struct sockaddr_in) ||
sin->sin_addr.s_addr == htonl(INADDR_ANY) ||
sin->sin_addr.s_addr == htonl(INADDR_BROADCAST) ||
-   IN_MULTICAST(ntohl(sin->sin_addr.s_addr)))
+   ipv4_is_multicast(sin->sin_addr.s_addr))
return -EINVAL;
ipv6_addr_set_v4mapped(sin->sin_addr.s_addr, &v6addr);
binding_addr = &v6addr;
@@ -206,7 +206,7 @@ int rds_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
addr4 = sin6->sin6_addr.s6_addr32[3];
if (addr4 == htonl(INADDR_ANY) ||
addr4 == htonl(INADDR_BROADCAST) ||
-   IN_MULTICAST(ntohl(addr4)))
+   ipv4_is_multicast(addr4))
return -EINVAL;
}
/* The scope ID must be specified for link local address. */
diff --git a/net/rds/send.c b/net/rds/send.c
index 9ce552abf9e9..82dcd8b84fe7 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -1144,7 +1144,7 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
case AF_INET:
if (usin->sin_addr.s_addr == htonl(INADDR_ANY) ||
usin->sin_addr.s_addr == htonl(INADDR_BROADCAST) ||
-   IN_MULTICAST(ntohl(usin->sin_addr.s_addr))) {
+   ipv4_is_multicast(usin->sin

how to search for the best route from userspace in netlink?

2019-09-02 Thread Dave Taht

Windows has the "RtmGetMostSpecificDestination" call:
https://docs.microsoft.com/en-us/windows/win32/rras/search-for-the-best-route

In particular, I wanted to search for the best route, AND pick up the
PMTU from that (if it existed)
for older UDP applications like dnssec[1] and newer ones like QUIC[2].
The alternatives, which
basically include probing perpetually and/or picking really low
values, seem increasingly less than
optimal.

Put in a wrapper around bpf[3]'s lpm calls? Create a new netlink message?

[1] https://github.com/dns-violations/dnsflagday/issues/125
[2] https://tools.ietf.org/html/draft-ietf-quic-transport-22#section-14.1
[3] https://github.com/torvalds/linux/blob/master/kernel/bpf/lpm_trie.c#L164

-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

Re: Unable to create htb tc classes more than 64K

2019-08-27 Thread Dave Taht

On Sun, Aug 25, 2019 at 11:47 PM Eric Dumazet  wrote:
>
>
>
> On 8/25/19 7:52 PM, Cong Wang wrote:
> > On Wed, Aug 21, 2019 at 11:00 PM Akshat Kakkar  
> > wrote:
> >>
> >> On Thu, Aug 22, 2019 at 3:37 AM Cong Wang  wrote:
>  I am using ipset +  iptables to classify and not filters. Besides, if
>  tc is allowing me to define qdisc -> classes -> qdsic -> classes
>  (1,2,3 ...) sort of structure (ie like the one shown in ascii tree)
>  then how can those lowest child classes be actually used or consumed?
> >>>
> >>> Just install tc filters on the lower level too.
> >>
> >> If I understand correctly, you are saying,
> >> instead of :
> >> tc filter add dev eno2 parent 100: protocol ip prio 1 handle
> >> 0x0001 fw flowid 1:10
> >> tc filter add dev eno2 parent 100: protocol ip prio 1 handle
> >> 0x0002 fw flowid 1:20
> >> tc filter add dev eno2 parent 100: protocol ip prio 1 handle
> >> 0x0003 fw flowid 2:10
> >> tc filter add dev eno2 parent 100: protocol ip prio 1 handle
> >> 0x0004 fw flowid 2:20
> >>
> >>
> >> I should do this: (i.e. changing parent to just immediate qdisc)
> >> tc filter add dev eno2 parent 1: protocol ip prio 1 handle 0x0001
> >> fw flowid 1:10
> >> tc filter add dev eno2 parent 1: protocol ip prio 1 handle 0x0002
> >> fw flowid 1:20
> >> tc filter add dev eno2 parent 2: protocol ip prio 1 handle 0x0003
> >> fw flowid 2:10
> >> tc filter add dev eno2 parent 2: protocol ip prio 1 handle 0x0004
> >> fw flowid 2:20
> >
> >
> > Yes, this is what I meant.
> >
> >
> >>
> >> I tried this previously. But there is not change in the result.
> >> Behaviour is exactly same, i.e. I am still getting 100Mbps and not
> >> 100kbps or 300kbps
> >>
> >> Besides, as I mentioned previously I am using ipset + skbprio and not
> >> filters stuff. Filters I used just to test.
> >>
> >> ipset  -N foo hash:ip,mark skbinfo
> >>
> >> ipset -A foo 10.10.10.10, 0x0x0001 skbprio 1:10
> >> ipset -A foo 10.10.10.20, 0x0x0002 skbprio 1:20
> >> ipset -A foo 10.10.10.30, 0x0x0003 skbprio 2:10
> >> ipset -A foo 10.10.10.40, 0x0x0004 skbprio 2:20
> >>
> >> iptables -A POSTROUTING -j SET --map-set foo dst,dst --map-prio
> >
> > Hmm..
> >
> > I am not familiar with ipset, but it seems to save the skbprio into
> > skb->priority, so it doesn't need TC filter to classify it again.
> >
> > I guess your packets might go to the direct queue of HTB, which
> > bypasses the token bucket. Can you dump the stats and check?
>
> With more than 64K 'classes' I suggest to use a single FQ qdisc [1], and
> an eBPF program using EDT model (Earliest Departure Time)

Although this is very cool, I think in this case the OP is being
a router, not server?

> The BPF program would perform the classification, then find a data structure
> based on the 'class', and then update/maintain class virtual times and 
> skb->tstamp
>
> TBF = bpf_map_lookup_elem(&map, &classid);
>
> uint64_t now = bpf_ktime_get_ns();
> uint64_t time_to_send = max(TBF->time_to_send, now);
>
> time_to_send += (u64)qdisc_pkt_len(skb) * NSEC_PER_SEC / TBF->rate;
> if (time_to_send > TBF->max_horizon) {
> return TC_ACT_SHOT;
> }
> TBF->time_to_send = time_to_send;
> skb->tstamp = max(time_to_send, skb->tstamp);
> if (time_to_send - now > TBF->ecn_horizon)
> bpf_skb_ecn_set_ce(skb);
> return TC_ACT_OK;
>
> tools/testing/selftests/bpf/progs/test_tc_edt.c shows something similar.
>
>
> [1]  MQ + FQ if the device is multi-queues.
>
>Note that this setup scales very well on SMP, since we no longer are forced
>  to use a single HTB hierarchy (protected by a single spinlock)
>


-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

Re: [net-next] net: sched: pie: enable timestamp based delay calculation

2019-08-27 Thread Dave Taht

On Tue, Aug 27, 2019 at 8:34 AM Eric Dumazet  wrote:
>
>
>
> On 8/27/19 4:19 PM, Gautam Ramakrishnan wrote:
> > RFC 8033 suggests an alternative approach to calculate the queue
> > delay in PIE by using per packet timestamps. This patch enables the
> > PIE implementation to do this.
> >
> > The calculation of queue delay is as follows:
> >   qdelay = now - packet_enqueue_time
> >
> > To enable the use of timestamps:
> >   modprobe sch_pie use_timestamps=1
>
>
> No module parameter is accepted these days.
>
> Please add a new attribute instead,
> so that pie can be used in both mode on the same host.

I note that I think (but lack independent data) this improvement to
the rate estimator in pie should improve its usability and accuracy
in low rate or hw mq situations, and with some benchmarking to
show the cpu impact (at high and low rates) of this improvement as
well as the network
impact, the old way should probably be dropped and new way adopted without
needing a new variable to control it.

A commit showing the before/after cpu and network impact with a whole
bunch of flent benchmarks would be great.

(I'd also love to know if pie can be run with a lower target - like 5ms -
 with this mod in place)

>
> For a typical example of attribute addition, please take
> a look at commit 48872c11b77271ef9b070bdc50afe6655c4eb9aa
> ("net_sched: sch_fq: add dctcp-like marking")

utilizing ce_threshold in this way is actually independent of whether or
not you are using dctcp-style or rfc3168 style ecn marking.

It's "I'm self congesting... Dooo something! Anything, to reduce the load!"

-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

Re: [PATCH net-next v5] sched: Add dualpi2 qdisc

2019-08-22 Thread Dave Taht

This is vastly improved code, thank you!

1) Since we're still duking it out over the meaning of the bits - not
just the SCE thing, but as best as I can
tell (but could be wrong) the NQB idea wants to put something into the
l4s fast queue? Or is NQB supposed to
be a third queue?

In those cases, the ecn_mask should just be mask.

2) Is the intent to make the drop probability 0 by default? (10 in the
pie rfc, not mentioned in the l4s rfc as yet)

3) has this been tested on a hw mq system as yet? (10gigE is typically
64 queues)

[PATCH net-next 2/2] fq_codel: Kill useless per-flow dropped statistic

2019-08-03 Thread Dave Taht

It is almost impossible to get anything other than a 0 out of
flow->dropped statistic with a tc class dump, as it resets to 0
on every round.

It also conflates ecn marks with drops.

It would have been useful had it kept a cumulative drop count, but
it doesn't. This patch doesn't change the API, it just stops
tracking a stat and state that is impossible to measure and nobody
uses.

Signed-off-by: Dave Taht 

---
 net/sched/sch_fq_codel.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index d67b2c40e6e6..9edd0f495001 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -45,7 +45,6 @@ struct fq_codel_flow {
struct sk_buff*tail;
struct list_head  flowchain;
int   deficit;
-   u32   dropped; /* number of drops (or ECN marks) on this 
flow */
struct codel_vars cvars;
 }; /* please try to keep this structure <= 64 bytes */
 
@@ -175,7 +174,6 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, 
unsigned int max_packets,
 
/* Tell codel to increase its signal strength also */
flow->cvars.count += i;
-   flow->dropped += i;
q->backlogs[idx] -= len;
q->memory_usage -= mem;
sch->qstats.drops += i;
@@ -213,7 +211,6 @@ static int fq_codel_enqueue(struct sk_buff *skb, struct 
Qdisc *sch,
list_add_tail(&flow->flowchain, &q->new_flows);
q->new_flow_count++;
flow->deficit = q->quantum;
-   flow->dropped = 0;
}
get_codel_cb(skb)->mem_usage = skb->truesize;
q->memory_usage += get_codel_cb(skb)->mem_usage;
@@ -312,9 +309,6 @@ static struct sk_buff *fq_codel_dequeue(struct Qdisc *sch)
&flow->cvars, &q->cstats, qdisc_pkt_len,
codel_get_enqueue_time, drop_func, dequeue_func);
 
-   flow->dropped += q->cstats.drop_count - prev_drop_count;
-   flow->dropped += q->cstats.ecn_mark - prev_ecn_mark;
-
if (!skb) {
/* force a pass through old_flows to prevent starvation */
if ((head == &q->new_flows) && !list_empty(&q->old_flows))
@@ -660,7 +654,7 @@ static int fq_codel_dump_class_stats(struct Qdisc *sch, 
unsigned long cl,
sch_tree_unlock(sch);
}
qs.backlog = q->backlogs[idx];
-   qs.drops = flow->dropped;
+   qs.drops = 0;
}
if (gnet_stats_copy_queue(d, NULL, &qs, qs.qlen) < 0)
return -1;
-- 
2.17.1

[PATCH net-next 1/2] Increase fq_codel count in the bulk dropper

2019-08-03 Thread Dave Taht

In the field fq_codel is often used with a smaller memory or
packet limit than the default, and when the bulk dropper is hit,
the drop pattern bifircates into one that more slowly increases
the codel drop rate and hits the bulk dropper more than it should.

The scan through the 1024 queues happens more often than it needs to.

This patch increases the codel count in the bulk dropper, but
does not change the drop rate there, relying on the next codel round
to deliver the next packet at the original drop rate
(after that burst of loss), then escalate to a higher signaling rate.

Signed-off-by: Dave Taht 

---
 net/sched/sch_fq_codel.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index d59fbcc745d1..d67b2c40e6e6 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -173,6 +173,8 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, 
unsigned int max_packets,
__qdisc_drop(skb, to_free);
} while (++i < max_packets && len < threshold);
 
+   /* Tell codel to increase its signal strength also */
+   flow->cvars.count += i;
flow->dropped += i;
q->backlogs[idx] -= len;
q->memory_usage -= mem;
-- 
2.17.1

[PATCH net-next 0/2] Two small fq_codel optimizations

2019-08-03 Thread Dave Taht

These two patches improve fq_codel performance 
under extreme network loads. The first patch 
more rapidly escalates the codel count under
overload, the second just kills a totally useless
statistic. 

(sent together because they'd otherwise conflict)

Signed-off-by: Dave Taht 

Dave Taht (2):
  Increase fq_codel count in the bulk dropper
  fq_codel: Kill useless per-flow dropped statistic

 net/sched/sch_fq_codel.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

-- 
2.17.1

Re: [RFC iproute2 0/4] Revert tc batchsize feature

2019-07-31 Thread Dave Taht

Does this fix my longstanding issue with piping commands into it? :P

( https://github.com/tohojo/flent/issues/146 )

On Wed, Jul 31, 2019 at 5:46 PM Stephen Hemminger
 wrote:
>
> The batchsize feature of tc might save a few cycles but it
> is a maintaince nightmare, it has uninitialized variables and
> poor error handling.
>
> This patch set reverts back to the original state.
> Please don't resubmit original code. Go back to the drawing
> board and do something generic.  For example, the routing
> daemons have figured out that by using multiple threads and
> turning off the netlink ACK they can update millions of routes
> quickly.
>
> Stephen Hemminger (4):
>   Revert "tc: Remove pointless assignments in batch()"
>   Revert "tc: flush after each command in batch mode"
>   Revert "tc: fix batch force option"
>   Revert "tc: Add batchsize feature for filter and actions"
>
>  tc/m_action.c  |  65 ++--
>  tc/tc.c| 201 -
>  tc/tc_common.h |   7 +-
>  tc/tc_filter.c | 129 ---
>  4 files changed, 87 insertions(+), 315 deletions(-)
>


-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

Re: Request for backport of 96125bf9985a75db00496dd2bc9249b777d2b19b

2019-07-17 Thread Dave Taht

On Mon, Jul 15, 2019 at 11:01 AM Loganaden Velvindron
 wrote:
>
> On Fri, Jul 5, 2019 at 6:15 PM Loganaden Velvindron  
> wrote:
> >
> > Hi folks,
> >
> > I read the guidelines for LTS/stable.
> > https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> >
> >
> > Although this is not a bugfix, I am humbly submitting a request so
> > that commit id
> > -- 96125bf9985a75db00496dd2bc9249b777d2b19b Allow 0.0.0.0/8 as a valid
> > address range --  is backported to all LTS kernels.
> >
> > My motivation for such a request is that we need this patch to be as
> > widely deployed as possible and as early as possible for interop and
> > hopefully move into better utilization of ipv4 addresses space. Hence
> > my request for it be added to -stable.
> >
>
> Any feedback ?
>
> > Kind regards,
> > //Logan

I am perfectly willing to wait a year or so on the -stable front to
see what, if any, problems that ensue from mainlining this in 5.3.
It's straightforward for distros that wish to do this backport (like
openwrt) to do it now, and other OSes will take longer than this to
adopt, regardless.


-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

[net-next 0/1] Allow 0.0.0.0/8 as a valid address range

2019-06-22 Thread Dave Taht

My talk's slides and video at netdev 0x13 about

"Potential IPv4 Unicast expansions" is up, here: 

https://netdevconf.org/0x13/session.html?talk-ipv4-unicast-expansions

There are roughly 419 million IPv4 addresses that are unallocated and
unused in the 0, localhost, reserved future multicast, and 240/4
spaces.

Linux already supports 240/4 fully. SDN's such as AWS already support
the entire IPv4 address space as a unicast playground.

This first patch for 0/8 was intended primarily as a conversation
starter - arguably we should rename the function across 22 fairly
"hot" files - but:

Should Linux treat these ranges as policy, and no longer enforce via
mechanism?

A full patchset for adding 225-232, and 127 address spaces is on github:
https://github.com/dtaht/unicast-extensions

with the very few needed patches for routing daemons and BSD also
available there.

Dave Taht (1):
  Allow 0.0.0.0/8 as a valid address range

 include/linux/in.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.17.1

[PATCH net-next 1/1] Allow 0.0.0.0/8 as a valid address range

2019-06-22 Thread Dave Taht

The longstanding prohibition against using 0.0.0.0/8 dates back
to two issues with the early internet.

There was an interoperability problem with BSD 4.2 in 1984, fixed in
BSD 4.3 in 1986. BSD 4.2 has long since been retired. 

Secondly, addresses of the form 0.x.y.z were initially defined only as
a source address in an ICMP datagram, indicating "node number x.y.z on
this IPv4 network", by nodes that know their address on their local
network, but do not yet know their network prefix, in RFC0792 (page
19).  This usage of 0.x.y.z was later repealed in RFC1122 (section
3.2.2.7), because the original ICMP-based mechanism for learning the
network prefix was unworkable on many networks such as Ethernet (which
have longer addresses that would not fit into the 24 "node number"
bits).  Modern networks use reverse ARP (RFC0903) or BOOTP (RFC0951)
or DHCP (RFC2131) to find their full 32-bit address and CIDR netmask
(and other parameters such as default gateways). 0.x.y.z has had
16,777,215 addresses in 0.0.0.0/8 space left unused and reserved for
future use, since 1989.

This patch allows for these 16m new IPv4 addresses to appear within
a box or on the wire. Layer 2 switches don't care.

0.0.0.0/32 is still prohibited, of course.

Signed-off-by: Dave Taht 
Signed-off-by: John Gilmore 
Acked-by: Toke Høiland-Jørgensen 

---
 include/linux/in.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 4d2fedfb753a..1873ef642605 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -63,7 +63,7 @@ static inline bool ipv4_is_all_snoopers(__be32 addr)
 
 static inline bool ipv4_is_zeronet(__be32 addr)
 {
-   return (addr & htonl(0xff00)) == htonl(0x);
+   return (addr == 0);
 }
 
 /* Special-Use IPv4 Addresses (RFC3330) */
-- 
2.17.1

[RFC PATCH net-next 1/1] Allow 0.0.0.0/8 as a valid address range

2019-06-13 Thread Dave Taht

The longstanding prohibition against using 0.0.0.0/8 dates back
to two issues with the early internet.

There was an interoperability problem with BSD 4.2 in 1984, fixed in
BSD 4.3 in 1986. BSD 4.2 has long since been retired. 

Secondly, addresses of the form 0.x.y.z were initially defined only as
a source address in an ICMP datagram, indicating "node number x.y.z on
this IPv4 network", by nodes that know their address on their local
network, but do not yet know their network prefix, in RFC0792 (page
19).  This usage of 0.x.y.z was later repealed in RFC1122 (section
3.2.2.7), because the original ICMP-based mechanism for learning the
network prefix was unworkable on many networks such as Ethernet (which
have longer addresses that would not fit into the 24 "node number"
bits).  Modern networks use reverse ARP (RFC0903) or BOOTP (RFC0951)
or DHCP (RFC2131) to find their full 32-bit address and CIDR netmask
(and other parameters such as default gateways). 0.x.y.z has had
16,777,215 addresses in 0.0.0.0/8 space left unused and reserved for
future use, since 1989.

This patch allows for these 16m new IPv4 addresses to appear within
a box or on the wire. Layer 2 switches don't care.

0.0.0.0/32 is still prohibited, of course.

Signed-off-by: Dave Taht 

---
 include/linux/in.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 4d2fedfb753a..1873ef642605 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -63,7 +63,7 @@ static inline bool ipv4_is_all_snoopers(__be32 addr)
 
 static inline bool ipv4_is_zeronet(__be32 addr)
 {
-   return (addr & htonl(0xff00)) == htonl(0x);
+   return (addr == 0);
 }
 
 /* Special-Use IPv4 Addresses (RFC3330) */
-- 
2.17.1

[RFC PATCH net-next 0/1] Allow 0.0.0.0/8 as a valid address range

2019-06-13 Thread Dave Taht

My talk's slides and video at netdev 0x13 about

"Potential IPv4 Unicast expansions" is up, here: 

https://netdevconf.org/0x13/session.html?talk-ipv4-unicast-expansions

There are roughly 419 million IPv4 addresses that are unallocated and
unused in the 0, localhost, reserved future multicast, and 240/4
spaces.

Linux already supports 240/4 fully. SDN's such as AWS already support
the entire IPv4 address space as a unicast playground.

This first patch for 0/8 is intended primarily as a conversation
starter - arguably we should rename the function across 22 fairly
"hot" files - but:

Should Linux treat these ranges as policy, and no longer enforce via
mechanism?

A full patchset for the remainder of the address spaces is on github:
https://github.com/dtaht/unicast-extensions

with the very few needed patches for routing daemons and BSD also
available there.

Dave Taht (1):
  Allow 0.0.0.0/8 as a valid address range

 include/linux/in.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.17.1

Re: [RFC net-next 2/2] net: sched: fq_pie: Flow Queue PIE AQM

2019-04-07 Thread Dave Taht

On Mon, Apr 8, 2019 at 7:37 AM Dave Taht  wrote:
>
> On Mon, Apr 8, 2019 at 7:31 AM Gautam Ramakrishnan  
> wrote:
> >
> > I was trying to refactor the code and I ran into some issues.
> > 1. I moved some of the parameters such as flows_cnt into a new struct
> > called fq_pie_params, instead of keeping them in fq_pie_sched_data.
> > Should I move those parameters back into fq_pie_sched_data?
> > 2. fq_codel maintains the backlog variable as a list in the
> > fq_codel_sched_data, whereas I maintain a backlog in the struct
> > fq_pie_flow. What approach should I follow?
>
> Hmm. I had made some efforts to speed up fq_codel here:
> https://github.com/dtaht/fq_codel_fast
>
> based on ideas here:
> https://lists.bufferbloat.net/pipermail/codel/2018-August/002367.html
>
> Notably I wanted to get rid of the non O(1) bulk dropper search and
> keep the flow backlog closer to the flow stats.
> Needed oprofiling to see if it helped.

I'm sorry to conflate your fq_pie upstreaming attempt with my own long
out of tree improvements for fq_codel. I'd been meaning to upstream
multiple fixes from there for ages, but needed to profile them on mips
first.

Are you in a hurry? Because: hacking in SCE into pie and fq_pie, etc,
is WAY more interesting than mainlining stuff right now. :)

Our updated "some congestion experienced"  internet draft with
suggestions for pie and red is here:

https://github.com/dtaht/bufferbloat-rfcs/blob/master/sce/draft-morton-taht-tsvwg-sce.txt#L272

and running code for sce is in that fq_codel_fast repo and in the out
of tree sch_cake also:

https://github.com/dtaht/sch_cake/commit/755ba4cda56bec45d20f7928f1442fdf83c12573

> > 3. Would maintaining a per flow pie_stats be useful in the future? I
> > do not use the per flow stats anywhere in the code.
>
> In general I have not found per flow stats useful or accessible and it
> was one of the things I eliminated in fq_codel_fast
> > On Tue, Apr 2, 2019 at 10:55 PM Toke Høiland-Jørgensen  
> > wrote:
> > >
> > > Gautam Ramakrishnan  writes:
> > >
> > > > Hello, thanks for the feedback
> > > >
> > > > On Tue, Apr 2, 2019 at 4:19 PM Toke Høiland-Jørgensen  
> > > > wrote:
> > > >>
> > > >> Some suggestions below to make fq_pie and fq_codel more similar (ref. 
> > > >> my
> > > >> previous email).
> > > >>
> > > >> Also, a few unrelated nits.
> > > >>
> > > >> > From: Mohit P. Tahiliani 
> > > >> >
> > > >> > FQ-PIE incorporates fair/flow queuing in which every queue
> > > >> > is managed by an instance of PIE queueing discipline.
> > > >> > The algorithm provides good control over the queueing delay
> > > >> > while at the same time, ensures fairness across various
> > > >> > flows sharing the same link.
> > > >> >
> > > >> > Principles:
> > > >> >   - Packets are classified stochastically on flows.
> > > >> >   - Each flow has a PIE managed queue.
> > > >> >   - Flows are linked onto two (Round Robin) lists,
> > > >> > so that new flows have priority on old ones.
> > > >> >   - For a given flow, packets are dropped on arrival according
> > > >> > to a drop probability.
> > > >> >   - Tail drops only.
> > > >>
> > > >> Why tail drops only?
> > > >
> > > > I had mentioned this because packets are dropped only at enqueue.
> > >
> > > Yup, realise that; was just wondering why you went with this design
> > > instead of doing the head drop that fq_codel does?
> > >
> > > >>
> > > >> > Usage:
> > > >> > tc qdisc ... fq_pie [ limit PACKETS ] [ flows NUMBER ]
> > > >> > [ alpha NUMBER ] [ beta NUMBER ]
> > > >> >     [ target TIME us ] [ tupdate TIME us ]
> > > >> >   [ bytemode ] [ quantum BYTES ]
> > > >> >   [ ecn | ecn_prob PERCENTAGE ]
> > > >> >
> > > >> > defaults: 1024 flows, 10240 packets limit, quantum : device MTU
> > > >> >target: 15ms
> > > >> >tupdate: 15ms
> > > >> >alpha: 2 (on a scale of 0 to 16)
> > > >> >beta: 20 (on a scale of 0 to 16)
> > > >> >ecn: false
> > > >

Re: [RFC net-next 2/2] net: sched: fq_pie: Flow Queue PIE AQM

2019-04-07 Thread Dave Taht

On Mon, Apr 8, 2019 at 7:31 AM Gautam Ramakrishnan  wrote:
>
> I was trying to refactor the code and I ran into some issues.
> 1. I moved some of the parameters such as flows_cnt into a new struct
> called fq_pie_params, instead of keeping them in fq_pie_sched_data.
> Should I move those parameters back into fq_pie_sched_data?
> 2. fq_codel maintains the backlog variable as a list in the
> fq_codel_sched_data, whereas I maintain a backlog in the struct
> fq_pie_flow. What approach should I follow?

Hmm. I had made some efforts to speed up fq_codel here:
https://github.com/dtaht/fq_codel_fast

based on ideas here:
https://lists.bufferbloat.net/pipermail/codel/2018-August/002367.html

Notably I wanted to get rid of the non O(1) bulk dropper search and
keep the flow backlog closer to the flow stats.
Needed oprofiling to see if it helped.

> 3. Would maintaining a per flow pie_stats be useful in the future? I
> do not use the per flow stats anywhere in the code.

In general I have not found per flow stats useful or accessible and it
was one of the things I eliminated in fq_codel_fast
> On Tue, Apr 2, 2019 at 10:55 PM Toke Høiland-Jørgensen  
> wrote:
> >
> > Gautam Ramakrishnan  writes:
> >
> > > Hello, thanks for the feedback
> > >
> > > On Tue, Apr 2, 2019 at 4:19 PM Toke Høiland-Jørgensen  
> > > wrote:
> > >>
> > >> Some suggestions below to make fq_pie and fq_codel more similar (ref. my
> > >> previous email).
> > >>
> > >> Also, a few unrelated nits.
> > >>
> > >> > From: Mohit P. Tahiliani 
> > >> >
> > >> > FQ-PIE incorporates fair/flow queuing in which every queue
> > >> > is managed by an instance of PIE queueing discipline.
> > >> > The algorithm provides good control over the queueing delay
> > >> > while at the same time, ensures fairness across various
> > >> > flows sharing the same link.
> > >> >
> > >> > Principles:
> > >> >   - Packets are classified stochastically on flows.
> > >> >   - Each flow has a PIE managed queue.
> > >> >   - Flows are linked onto two (Round Robin) lists,
> > >> > so that new flows have priority on old ones.
> > >> >   - For a given flow, packets are dropped on arrival according
> > >> > to a drop probability.
> > >> >   - Tail drops only.
> > >>
> > >> Why tail drops only?
> > >
> > > I had mentioned this because packets are dropped only at enqueue.
> >
> > Yup, realise that; was just wondering why you went with this design
> > instead of doing the head drop that fq_codel does?
> >
> > >>
> > >> > Usage:
> > >> > tc qdisc ... fq_pie [ limit PACKETS ] [ flows NUMBER ]
> > >> > [ alpha NUMBER ] [ beta NUMBER ]
> > >> > [ target TIME us ] [ tupdate TIME us ]
> > >> >   [ bytemode ] [ quantum BYTES ]
> > >> >   [ ecn | ecn_prob PERCENTAGE ]
> > >> >
> > >> > defaults: 1024 flows, 10240 packets limit, quantum : device MTU
> > >> >target: 15ms
> > >> >tupdate: 15ms
> > >> >alpha: 2 (on a scale of 0 to 16)
> > >> >beta: 20 (on a scale of 0 to 16)
> > >> >ecn: false
> > >> >ecn_prob: 10%
> > >> >
> > >> > Signed-off-by: Mohit P. Tahiliani 
> > >> > Signed-off-by: Sachin D. Patil 
> > >> > Signed-off-by: Mohit Bhasi 
> > >> > Signed-off-by: V. Saicharan 
> > >> > Signed-off-by: Leslie Monis 
> > >> > Signed-off-by: Gautam Ramakrishnan 
> > >> > Cc: Dave Taht 
> > >> > ---
> > >> >  include/uapi/linux/pkt_sched.h |  28 ++
> > >> >  net/sched/Kconfig  |  14 +-
> > >> >  net/sched/Makefile |   1 +
> > >> >  net/sched/sch_fq_pie.c | 485 +
> > >> >  4 files changed, 527 insertions(+), 1 deletion(-)
> > >> >  create mode 100644 net/sched/sch_fq_pie.c
> > >> >
> > >> > diff --git a/include/uapi/linux/pkt_sched.h 
> > >> > b/include/uapi/linux/pkt_sched.h
> > >> > index 7ee74c3474bf..005413bd09ee 100644
> > >> > --- a/include/uapi/linux/pkt_sched.h
> > >> > +++ b/include/uapi/l

Re: [PATCH] net: can: Increase tx queue length

2019-03-09 Thread Dave Taht

Toke Høiland-Jørgensen  writes:

> Appana Durga Kedareswara Rao  writes:
>
>> Hi Andre,
>>
>>  
>>> 
>>> On 3/9/19 3:07 PM, Appana Durga Kedareswara rao wrote:
>>> > While stress testing the CAN interface on xilinx axi can in loopback
>>> > mode getting message "write: no buffer space available"
>>> > Increasing device tx queue length resolved the above mentioned issue.
>>> 
>>> No need to patch the kernel:
>>> 
>>> $ ip link set  txqueuelen 500
>>> 
>>> does the same thing.
>>
>> Thanks for the review... 
>> Agree but it is not an out of box solution right?? 
>> Do you have any idea for socket can devices why the tx queue length is 10 
>> whereas
>> for other network devices (ex: ethernet) it is 1000 ??
>
> Probably because you don't generally want a long queue adding latency on
> a CAN interface? The default 1000 is already way too much even for an
> Ethernet device in a lot of cases.
>
> If you get "out of buffer" errors it means your application is sending
> things faster than the receiver (or device) can handle them. If you
> solve this by increasing the queue length you are just papering over the
> underlying issue, and trading latency for fewer errors. This tradeoff
> *may* be appropriate for your particular application, but I can imagine
> it would not be appropriate as a default. Keeping the buffer size small
> allows errors to propagate up to the application, which can then back
> off, or do something smarter, as appropriate.
>
> I don't know anything about the actual discussions going on when the
> defaults were set, but I can imagine something along the lines of the
> above was probably a part of it :)
>
> -Toke

In a related discussion, loud and often difficult, over here on the can bus, 

https://github.com/systemd/systemd/issues/9194#issuecomment-469403685

we found that applying fq_codel as the default via sysctl qdisc a bad
idea for systems for at least one model of can device.

If you scroll back on the bug, a good description of what the can
subsystem expects from the qdisc is therein - it mandates an in-order
fifo qdisc or no queue at all. the CAN protocol expects each packet to
be transmitted successfully or rejected, and if so, passes the error up
to userspace and is supposed to stop for further input.

As this was the first serious bug ever reported against using fq_codel
as the default in 5+ years of systemd and 7 of openwrt deployment I've
been taking it very seriously. It's worse than just systemd - openwrt
patches out pfifo_fast entirely. pfifo_fast is the wrong qdisc - the
right choices are noqueue and possibly pfifo.

However, the vcan device exposes noqueue, and so far it has been only
the one device ( a 8Devices socketcan USB2CAN ) that did not do this in
their driver that was misbehaving.

Which was just corrected with a simple:

static int usb_8dev_probe(struct usb_interface *intf,
 const struct usb_device_id *id)
{
 ...
 netdev->netdev_ops = &usb_8dev_netdev_ops;

 netdev->flags |= IFF_ECHO; /* we support local echo */
+netdev->priv_flags |= IFF_NO_QUEUE;
 ...
}

and successfully tested on that bug report.

So at the moment, my thought is that all can devices should default to
noqueue, if they are not already. I think a pfifo_fast and a qlen of any
size is the wrong thing, but I still don't know enough about what other
can devices do or did to be certain.

Re: [PATCH net-next v3 0/7] net: sched: pie: align PIE implementation with RFC 8033

2019-02-25 Thread Dave Taht

On Mon, Feb 25, 2019 at 4:38 PM Stephen Hemminger
 wrote:
>
> On Tue, 26 Feb 2019 00:39:54 +0530
> Leslie Monis  wrote:
>
> > The current implementation of the PIE queuing discipline is according to the
> > IETF draft [http://tools.ietf.org/html/draft-pan-aqm-pie-00] and the paper
> > [PIE: A Lightweight Control Scheme to Address the Bufferbloat Problem].
> > However, a lot of necessary modifications and enhancements have been 
> > proposed
> > in RFC 8033, which have not yet been incorporated in the source code of 
> > Linux.
> > This patch series helps in achieving the same.
> >
> > Performance tests carried out using Flent [https://flent.org/]
> >
> > Changes from v2 to v3:
> >   - Used div_u64() instead of direct division after explicit type casting as
> > recommended by David
> >
> > Changes from v1 to v2:
> >   - Excluded the patch setting PIE dynamically active/inactive as the test
> > results were unsatisfactory
> >   - Fixed a scaling issue when adding more auto-tuning cases which caused
> > local variables to underflow
> >   - Changed the long if/else chain to a loop as suggested by Stephen
> >   - Changed the position of the accu_prob variable in the pie_vars
> > structure as recommended by Stephen
> >
> > Mohit P. Tahiliani (7):
> >   net: sched: pie: change value of QUEUE_THRESHOLD
> >   net: sched: pie: change default value of pie_params->target
> >   net: sched: pie: change default value of pie_params->tupdate
> >   net: sched: pie: change initial value of pie_vars->burst_time
> >   net: sched: pie: add more cases to auto-tune alpha and beta
> >   net: sched: pie: add derandomization mechanism
> >   net: sched: pie: update references
> >
> >  include/uapi/linux/pkt_sched.h |   2 +-
> >  net/sched/sch_pie.c| 107 -
> >  2 files changed, 66 insertions(+), 43 deletions(-)
>
> Are you concerned at all that changes to default values might change
> expected behavior of existing users?

There's existing users?

Most of these changes are really subtle and came out as we went along.
I'd have to *really search* to find my notes from that period,
and I have not evaluated the sum of these new changes, but overall
they were a slight improvement from the defaults as we shipped it way
back when.

I'm not being snarky - are there existing users? Every last person I
know that ever evaluated pie switched to fq_codel. Cisco ended up
going with another bufferbloat-fighting solution in one new device and
that team dissolved.

docsis-pie, well, that's a standard. and that's a bit different from
this pie, but this pie is closer to that pie. And it works ok.

I certainly hope leslie's group has done similar testing to what we
did then ? If there's results I can dig up my old results and see what
compares. I imagine they plan a paper. our old flent test code and
data is all published, but am not willing to spend another minute of
my life on a single queued aqm design unless someone writes me a very
big check for it, which I'd then spend on fixing P4 to run fq-codel.

fq-pie (which is waiting in the wings behind this set of patches) is
ok, as at least, the BSD version was competetive with fq_codel in most
respects. I told 'em they needed to look hard at the rate estimator.

The ECN handling problem mentioned is there on all known versions of
pie, but it's that it reverts to drop too soon for ecn to be
as much benefit as it could be.

and then l4s, which is the subject of 3 upcoming talks at netdevconf,
well... not gonna talk about it. I'm willing to give them their day in
court.

sorry to come across as grumpy. should linux be rfc compliant even if
the rfc has issues? damned if I know.

-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

Re: [PATCH net-next v2 0/7] net: sched: pie: align PIE implementation with RFC 8033

2019-02-25 Thread Dave Taht

Jamal Hadi Salim  writes:

> On 2019-02-25 8:43 a.m., Jamal Hadi Salim wrote:
>> On 2019-02-25 5:20 a.m., Leslie Monis wrote:
>>> The current implementation of the PIE queuing discipline is
>>> according to the
>>> IETF draft [http://tools.ietf.org/html/draft-pan-aqm-pie-00] and
>>> the paper
>>> [PIE: A Lightweight Control Scheme to Address the Bufferbloat Problem].
>>> However, a lot of necessary modifications and enhancements have
>>> been proposed
>>> in RFC 8033, which have not yet been incorporated in the source
>>> code of Linux.
>>> This patch series helps in achieving the same.
>>>
>>> Performance tests carried out using Flent [https://flent.org/]
>>>
>>
>> +Cc the authors of PIE to double check these values.
>> Please respond (it is how open source works!) and N/ACK
>
> Great. Bouncing addresses.
> D. Taht - can you look at this?

Whose addresses are bouncing? The folk that did pie originally were
mostly contractors and long ago moved on to other things.

My taht.net address tends to bounce to vger as I long ago mandated
starttls on all email transactions which is why I use my gmail account
for postings there. I keep hoping that one day vger will support
starttls... posting it here as I'm an optimist. I wasn't expecting to be
cc'd on this submittal

Pie seems to be mostly an abandoned CISCO effort outside of docsis-pie,
at least at the moment. I've heard not a peep from them in years.
(fq_codel seems to have mostly "won" in our world)

I did review all these changes when they went by in v1, and aside from
the ecn mistake ending up in the final RFC when I wasn't looking[1],
approve of these changes to sch_pie to make it compliant with the rfc,
finally.

[1] We proposed refining pie's ecn handling here:

https://github.com/gautamramk/FQ-PIE-for-Linux-Kernel/issues/2

Anyway:

Acked-by: Dave Taht

Re: [PATCH] Allow class-e address assignment via ifconfig ioctl

2018-12-15 Thread Dave Taht

David Miller  writes:

> From: Dave Taht 
> Date: Tue, 11 Dec 2018 15:30:34 -0800
>
>> While most distributions long ago switched to the iproute2 suite
>> of utilities, which allow class-e (240.0.0.0/4) address assignment,
>> distributions relying on busybox, toybox and other forms of
>> ifconfig cannot assign class-e addresses without this kernel patch.
>> 
>> While CIDR has been obsolete for 2 decades, and a survey of all the
>> open source code in the world shows the IN_whatever macros are also
>> obsolete... rather than obsolete CIDR from this ioctl entirely, this
>> patch merely enables class-e assignment, sanely.
>> 
>> Signed-off-by: Dave Taht 
>
> Applied, thanks Dave.

Thanks, but...

Into what tree did you pull it? it's not in net-next as I speak.

I reworked it a bit, giving a hat tip to Vince Fuller and his original
patch here: https://lkml.org/lkml/2008/1/7/370

http://www.taht.net/classe/0001-linux-kernel-Allow-class-e-address-assignment-via-ifconfig-ioctl.patch

and was minutes away from submitting that version when you took this.

[PATCH] Allow class-e address assignment via ifconfig ioctl

2018-12-11 Thread Dave Taht

While most distributions long ago switched to the iproute2 suite
of utilities, which allow class-e (240.0.0.0/4) address assignment,
distributions relying on busybox, toybox and other forms of
ifconfig cannot assign class-e addresses without this kernel patch.

While CIDR has been obsolete for 2 decades, and a survey of all the
open source code in the world shows the IN_whatever macros are also
obsolete... rather than obsolete CIDR from this ioctl entirely, this
patch merely enables class-e assignment, sanely.

Signed-off-by: Dave Taht 
---
 include/uapi/linux/in.h | 10 +++---
 net/ipv4/devinet.c  |  5 +++--
 net/ipv4/ipconfig.c |  2 ++
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
index 48e8a225b985..eba44371231f 100644
--- a/include/uapi/linux/in.h
+++ b/include/uapi/linux/in.h
@@ -266,10 +266,14 @@ struct sockaddr_in {
 
 #defineIN_CLASSD(a)long int) (a)) & 0xf000) == 
0xe000)
 #defineIN_MULTICAST(a) IN_CLASSD(a)
-#defineIN_MULTICAST_NET0xF000
+#defineIN_MULTICAST_NET0xe000
 
-#defineIN_EXPERIMENTAL(a)  long int) (a)) & 0xf000) == 
0xf000)
-#defineIN_BADCLASS(a)  IN_EXPERIMENTAL((a))
+#defineIN_BADCLASS(a)  long int) (a) ) == 0x)
+#defineIN_EXPERIMENTAL(a)  IN_BADCLASS((a))
+
+#defineIN_CLASSE(a)long int) (a)) & 0xf000) == 
0xf000)
+#defineIN_CLASSE_NET   0x
+#defineIN_CLASSE_NSHIFT0
 
 /* Address to accept any incoming messages. */
 #defineINADDR_ANY  ((unsigned long int) 0x)
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index a34602ae27de..608a6f4223fb 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -952,17 +952,18 @@ static int inet_abc_len(__be32 addr)
 {
int rc = -1;/* Something else, probably a multicast. */
 
-   if (ipv4_is_zeronet(addr))
+   if (ipv4_is_zeronet(addr) || ipv4_is_lbcast(addr))
rc = 0;
else {
__u32 haddr = ntohl(addr);
-
if (IN_CLASSA(haddr))
rc = 8;
else if (IN_CLASSB(haddr))
rc = 16;
else if (IN_CLASSC(haddr))
rc = 24;
+   else if (IN_CLASSE(haddr))
+   rc = 32;
}
 
return rc;
diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index 88212615bf4c..2393e5c106bf 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -429,6 +429,8 @@ static int __init ic_defaults(void)
ic_netmask = htonl(IN_CLASSB_NET);
else if (IN_CLASSC(ntohl(ic_myaddr)))
ic_netmask = htonl(IN_CLASSC_NET);
+   else if (IN_CLASSE(ntohl(ic_myaddr)))
+   ic_netmask = htonl(IN_CLASSE_NET);
else {
pr_err("IP-Config: Unable to guess netmask for address 
%pI4\n",
   &ic_myaddr);
-- 
2.17.1

Re: [PATCH v2 iproute2-next 1/3] tc: support conversions to or from 64 bit nanosecond-based time

2018-10-03 Thread Dave Taht

On Mon, Aug 27, 2018 at 9:39 AM Dave Taht  wrote:
>
> On Mon, Aug 27, 2018 at 9:11 AM Stephen Hemminger
>  wrote:
> >
> > On Sun, 26 Aug 2018 19:42:28 -0700
> > Yousuk Seung  wrote:
> >
> > > +int get_time(unsigned int *time, const char *str)
> > > +{
> > > + double t;
> > > + char *p;
> > > +
> > > + t = strtod(str, &p);
> > > + if (p == str)
> > > + return -1;
> > > +
> > > + if (*p) {
> > > + if (strcasecmp(p, "s") == 0 || strcasecmp(p, "sec") == 0 ||
> > > + strcasecmp(p, "secs") == 0)
> > > + t *= TIME_UNITS_PER_SEC;
> > > + else if (strcasecmp(p, "ms") == 0 || strcasecmp(p, "msec") 
> > > == 0 ||
> > > +  strcasecmp(p, "msecs") == 0)
> > > + t *= TIME_UNITS_PER_SEC/1000;
> > > + else if (strcasecmp(p, "us") == 0 || strcasecmp(p, "usec") 
> > > == 0 ||
> > > +  strcasecmp(p, "usecs") == 0)
> > > + t *= TIME_UNITS_PER_SEC/100;
> > > + else
> > > + return -1;
> >
> > Do we need to really support UPPER case.
>
> But that's  ALWAYS been the case in the 32 bit version of code above.
> Imagine how many former VMS and MVS hackers you'd upset if they had to
> turn caps-lock off!

I was trying to be funny, of course. If you want us to rework the
patch to also downgrade to
being lowercase for both, ok... I'd rather like to finish getting this
upstream, there's
a change to netem enabling nsec time long stuck behind it.

>
> > Isn't existing matches semantics good enough?
>
> But that's the existing case for the 32 bit api, now replicated in the
> 64 bit api. ? I think the case-insensitive ship has sailed here. Can't
> break userspace.
>
> Well.. adding UTF-8 would be cool. We could start using the actual
> greek  symbols for delta (δ) and beta (β) in particular. It would
> replace a lot of typing, with a whole bunch more shift keys on a
> single letter, fit better into
> 80 column lines, and so on, and tc inputs and outputs are already
> pretty greek to many.
> --
>
> Dave Täht
> CEO, TekLibre, LLC
> http://www.teklibre.com
> Tel: 1-669-226-2619



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

Re: [PATCH v2 iproute2-next 1/3] tc: support conversions to or from 64 bit nanosecond-based time

2018-08-27 Thread Dave Taht

On Mon, Aug 27, 2018 at 9:11 AM Stephen Hemminger
 wrote:
>
> On Sun, 26 Aug 2018 19:42:28 -0700
> Yousuk Seung  wrote:
>
> > +int get_time(unsigned int *time, const char *str)
> > +{
> > + double t;
> > + char *p;
> > +
> > + t = strtod(str, &p);
> > + if (p == str)
> > + return -1;
> > +
> > + if (*p) {
> > + if (strcasecmp(p, "s") == 0 || strcasecmp(p, "sec") == 0 ||
> > + strcasecmp(p, "secs") == 0)
> > + t *= TIME_UNITS_PER_SEC;
> > + else if (strcasecmp(p, "ms") == 0 || strcasecmp(p, "msec") == 
> > 0 ||
> > +  strcasecmp(p, "msecs") == 0)
> > + t *= TIME_UNITS_PER_SEC/1000;
> > + else if (strcasecmp(p, "us") == 0 || strcasecmp(p, "usec") == 
> > 0 ||
> > +  strcasecmp(p, "usecs") == 0)
> > + t *= TIME_UNITS_PER_SEC/100;
> > + else
> > + return -1;
>
> Do we need to really support UPPER case.

But that's  ALWAYS been the case in the 32 bit version of code above.
Imagine how many former VMS and MVS hackers you'd upset if they had to
turn caps-lock off!

> Isn't existing matches semantics good enough?

But that's the existing case for the 32 bit api, now replicated in the
64 bit api. ? I think the case-insensitive ship has sailed here. Can't
break userspace.

Well.. adding UTF-8 would be cool. We could start using the actual
greek  symbols for delta (δ) and beta (β) in particular. It would
replace a lot of typing, with a whole bunch more shift keys on a
single letter, fit better into
80 column lines, and so on, and tc inputs and outputs are already
pretty greek to many.
--

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

Re: [PATCH iproute2-next 1/3] tc: support conversions to or from 64 bit nanosecond-based time

2018-08-13 Thread Dave Taht

On Sun, Aug 12, 2018 at 3:09 PM David Ahern  wrote:
>
> On 8/6/18 11:09 AM, Yousuk Seung wrote:
> > diff --git a/tc/tc_core.h b/tc/tc_core.h
> > index 1dfa9a4f773b..a0fe0923d171 100644
> > --- a/tc/tc_core.h
> > +++ b/tc/tc_core.h
> > @@ -7,6 +7,10 @@
> >
> >  #define TIME_UNITS_PER_SEC   100
> >
> > +#define NSEC_PER_USEC 1000
> > +#define NSEC_PER_MSEC 100
> > +#define NSEC_PER_SEC 10LL
> > +
>
> These are not specific to tc so a header in include is a better location
> (utils.h or a new one)
>
> >  enum link_layer {
> >   LINKLAYER_UNSPEC,
> >   LINKLAYER_ETHERNET,
> > diff --git a/tc/tc_util.c b/tc/tc_util.c
> > index d7578528a31b..c39c9046dcae 100644
> > --- a/tc/tc_util.c
> > +++ b/tc/tc_util.c
>
> Similarly for these time functions - not specific to tc so move to
> lib/utils.c
>
> > @@ -385,6 +385,61 @@ char *sprint_ticks(__u32 ticks, char *buf)
> >   return sprint_time(tc_core_tick2time(ticks), buf);
> >  }
> >
> > +/* 64 bit times are represented internally in nanoseconds */
> > +int get_time64(__s64 *time, const char *str)
>
> __u64 seems more appropriate than __s64

The reason why these are signed is to leave room in the API to
print/manage negative values. Wasting the 64th bit thusly would only
matter after extreme uptimes.

There was something of a long debate on this when these patches went
around the first time. We ended up with signed time in the netem code
also.


> > +{
> > + double nsec;
> > + char *p;
> > +
> > + nsec = strtod(str, &p);
> > + if (p == str)
> > + return -1;
> > +
> > + if (*p) {
> > + if (strcasecmp(p, "s") == 0 ||
> > + strcasecmp(p, "sec") == 0 ||
> > + strcasecmp(p, "secs") == 0)
> > + nsec *= NSEC_PER_SEC;
> > + else if (strcasecmp(p, "ms") == 0 ||
> > +  strcasecmp(p, "msec") == 0 ||
> > +  strcasecmp(p, "msecs") == 0)
> > + nsec *= NSEC_PER_MSEC;
> > + else if (strcasecmp(p, "us") == 0 ||
> > +  strcasecmp(p, "usec") == 0 ||
> > +  strcasecmp(p, "usecs") == 0)
> > + nsec *= NSEC_PER_USEC;
> > + else if (strcasecmp(p, "ns") == 0 ||
> > +  strcasecmp(p, "nsec") == 0 ||
> > +  strcasecmp(p, "nsecs") == 0)
>
> strncasecmp would be more efficient
>
> > + nsec *= 1;
> > + else
> > + return -1;
> > + }
> > +
> > + *time = nsec;
> > + return 0;
> > +}
> > +
> > +void print_time64(char *buf, int len, __s64 time)
> > +{
> > + double nsec = time;
> > +
> > + if (time >= NSEC_PER_SEC)
> > + snprintf(buf, len, "%.3fs", nsec/NSEC_PER_SEC);
> > + else if (time >= NSEC_PER_MSEC)
> > + snprintf(buf, len, "%.3fms", nsec/NSEC_PER_MSEC);
> > + else if (time >= NSEC_PER_USEC)
> > + snprintf(buf, len, "%.3fus", nsec/NSEC_PER_USEC);
> > + else
> > + snprintf(buf, len, "%lldns", time);
> > +}
> > +
> > +char *sprint_time64(__s64 time, char *buf)
> > +{
> > + print_time64(buf, SPRINT_BSIZE-1, time);
> > + return buf;
> > +}
> > +
> >  int get_size(unsigned int *size, const char *str)
> >  {
> >   double sz;



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

Re: [PATCH v1 6/7] net: mvneta: Don't use GRO on Armada 3720

2018-08-08 Thread Dave Taht

On Wed, Aug 8, 2018 at 10:00 AM Andrew Lunn  wrote:
>
> On Wed, Aug 08, 2018 at 05:27:05PM +0200, Marek Behún wrote:
> > For some reason on Armada 3720 boards (EspressoBin and Turris Mox) the
> > networking driver behaves weirdly when using napi_gro_receive.
> >
> > For example downloading a big file from a local network (low ping) is
> > fast, but when downloading from a remote server (higher ping), the
> > download speed is at first high but drops rapidly to almost nothing or
> > absolutely nothing.
> >
> > This is fixed when using netif_receive_skb instead of napi_gro_receive.
>
> Before doing this, we should really understand what is going on. It is
> probably just a driver bug which needs fixing. And GRO should be good
> for performance, so we do want to use it, if possible.

I'd just disable it and worry about it later. The software gro in the
mvneta would batch up 64k and is one of the reasons why sch_cake does
gso splitting by default. (64k unsplit, downshifted to 1mbit = ~540ms
of latency). If this mvneta facility is in addition buggy, that
explains some puzzling things I've seen in various benchmarks. thx for
the steer as to what to look for!

IMHO: in general gro looks good on dumb single stream benchmarks, not
as useful on mixed routed traffic with more entropy, and batching
clutters up the mvneta receive path that otherwise could be draining
the rx ring and spitting packets into the rest of the system faster.
The mvneta is mostly (?) used on routing devices.

>
> Andrew

-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

Re: [PATCH iproute2] ip link: don't stop batch processing

2018-08-03 Thread Dave Taht

On Fri, Aug 3, 2018 at 10:50 AM Matteo Croce  wrote:
>
> When 'ip link show dev DEVICE' is processed in a batch mode, ip exits
> and stop processing further commands.
> This because ipaddr_list_flush_or_save() calls exit() to avoid printing
> the link information twice.
> Replace the exit with a classic goto out instruction.
>
> Signed-off-by: Matteo Croce 

one thing I noticed in iproute2-next last week is that

( echo qdisc show dev eno1; sleep 5; echo qdisc show dev eno1; ) | tc -b -

batches the whole thing up to emerge on exit, only.

It didn't used to do that, the output of every command came out as it
completed. I used to use that to timestamp and save the overhead of
invoking the tc utility on openwrt while monitoring qdisc stats in
https://github.com/tohojo/flent/blob/master/misc/tc_iterate.c

alternatively adding timed/timestamped output to tc like -c count or
-I interval would be useful.

> ---
>  ip/ipaddress.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/ip/ipaddress.c b/ip/ipaddress.c
> index 6c306ab7..b7b78f6e 100644
> --- a/ip/ipaddress.c
> +++ b/ip/ipaddress.c
> @@ -1920,7 +1920,7 @@ static int ipaddr_list_flush_or_save(int argc, char 
> **argv, int action)
> exit(1);
> }
> delete_json_obj();
> -   exit(0);
> +   goto out;
> }
>
> if (filter.family != AF_PACKET) {
> --
> 2.17.1
>


-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

[PATCH iproute2-next] sch_cake: Make gso-splitting configurable

2018-07-27 Thread Dave Taht

This patch makes sch_cake's gso/gro splitting configurable
from userspace.

To disable breaking apart superpackets in sch_cake:

tc qdisc replace dev whatever root cake no-split-gso

to enable:

tc qdisc replace dev whatever root cake split-gso

Signed-off-by: Toke Høiland-Jørgensen 
Signed-off-by: Dave Taht 

---
 tc/q_cake.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/tc/q_cake.c b/tc/q_cake.c
index f1e232a..727d673 100644
--- a/tc/q_cake.c
+++ b/tc/q_cake.c
@@ -79,6 +79,7 @@ static void explain(void)
 "  dual-srchost | dual-dsthost | triple-isolate* ]\n"
 "[ nat | nonat* ]\n"
 "[ wash | nowash* ]\n"
+"[ split-gso* | no-split-gso ]\n"
 "[ ack-filter | ack-filter-aggressive | no-ack-filter* ]\n"
 "[ memlimit LIMIT ]\n"
 "[ ptm | atm | noatm* ] [ overhead N | conservative | raw* ]\n"
@@ -108,6 +109,7 @@ static int cake_parse_opt(struct qdisc_util *qu, int argc, 
char **argv,
int nat = -1;
int atm = -1;
int mpu = 0;
+   int split_gso = -1;
 
while (argc > 0) {
if (strcmp(*argv, "bandwidth") == 0) {
@@ -155,6 +157,10 @@ static int cake_parse_opt(struct qdisc_util *qu, int argc, 
char **argv,
wash = 0;
} else if (strcmp(*argv, "wash") == 0) {
wash = 1;
+   } else if (strcmp(*argv, "split-gso") == 0) {
+   split_gso = 1;
+   } else if (strcmp(*argv, "no-split-gso") == 0) {
+   split_gso = 0;
} else if (strcmp(*argv, "flowblind") == 0) {
flowmode = CAKE_FLOW_NONE;
} else if (strcmp(*argv, "srchost") == 0) {
@@ -374,6 +380,9 @@ static int cake_parse_opt(struct qdisc_util *qu, int argc, 
char **argv,
addattr_l(n, 1024, TCA_CAKE_NAT, &nat, sizeof(nat));
if (wash != -1)
addattr_l(n, 1024, TCA_CAKE_WASH, &wash, sizeof(wash));
+   if (split_gso != -1)
+   addattr_l(n, 1024, TCA_CAKE_SPLIT_GSO, &split_gso,
+ sizeof(split_gso));
if (ingress != -1)
addattr_l(n, 1024, TCA_CAKE_INGRESS, &ingress, sizeof(ingress));
if (ack_filter != -1)
-- 
2.7.4

Re: [PATCH v2 net-next 01/14] net: Clear skb->tstamp only on the forwarding path

2018-07-18 Thread Dave Taht

In my dreamworld, a packet with a timestamp achieved at rx time on
ethX, or via local traffic, would be consistent with the right clock
throughout the system and reliably still be there when it goes to
ethY.

This would save having to timestamp (again) inside the cb block in
fq_codel, cake, etc, and more importantly, measure total system delay
from entrance to egress, (e.g tc filters, firewall rules, routing
table lookups), not just queuing delay at the qdisc, thus detecting
when a system was overloaded and reducing queue size and throughput
appropriately as a result to cope.

Re: [PATCH net-next 4/4] act_mirred: use ACT_REDIRECT when possible

2018-07-17 Thread Dave Taht

On Tue, Jul 17, 2018 at 2:16 AM Paolo Abeni  wrote:
>
> Hi,
>
> On Mon, 2018-07-16 at 16:39 -0700, Cong Wang wrote:
> > On Fri, Jul 13, 2018 at 2:55 AM Paolo Abeni  wrote:
> > >
> > > When mirred is invoked from the ingress path, and it wants to redirect
> > > the processed packet, it can now use the ACT_REDIRECT action,
> > > filling the tcf_result accordingly.
> > >
> > > This avoids a skb_clone() in the TC S/W data path giving a ~10%
> > > improvement in forwarding performances. Overall TC S/W performances
> > > are now comparable to the kernel openswitch datapath.
>
> Thank you for the feedback.
>
> > Avoiding skb_clone() for redirection is cool, but why need to use
> > skb_do_redirect() here?
>
> Well, it's not needed. I tried to reduce code duplication, and I tried
> to avoid adding another TC_ACT_* value.
>
> > There is a subtle difference here:
> >
> > skb_do_redirect() calls __bpf_rx_skb() which calls
> > dev_forward_skb().
> >
> > while the current mirred action doesn't scrub packets when
> > redirecting to ingress (from egress). Although I forget if it is
> > intentionally.
>
> Understood.
>
> A possible option out of this issues would be adding another action
> value - TC_ACT_MIRRED ? - and handle it in sch_handle_egress()[1] with
> the appropriate semantic. That should address also Daniel and Eyal
> concerns.
>
> Would you consider the above acceptable?

Our dream has been to be able to specify a 1 liner:

tc qdisc add dev eth0 ingress cake bandwidth 200mbit besteffort wash

and be done with it, eliminating ifb, mirred, etc, entirely.

(I am otherwise delighted to see anything appear that makes inbound
shaping cheaper along the lines of this patchset)
>
> Thanks,
>
> Paolo
>
>


-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

Re: [PATCH] ucc_geth: Add BQL support

2018-06-19 Thread Dave Taht

very happy to see this. is there a specific chip or devboard this runs on?

On Tue, Jun 19, 2018 at 11:24 AM, Li Yang  wrote:
> On Tue, Jun 19, 2018 at 11:30 AM, Joakim Tjernlund
>  wrote:
>> Signed-off-by: Joakim Tjernlund 
>
> Acked-by: Li Yang 
>
>> ---
>>  drivers/net/ethernet/freescale/ucc_geth.c | 7 ++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/freescale/ucc_geth.c 
>> b/drivers/net/ethernet/freescale/ucc_geth.c
>> index f77ba9fa257b..6c99a9af6647 100644
>> --- a/drivers/net/ethernet/freescale/ucc_geth.c
>> +++ b/drivers/net/ethernet/freescale/ucc_geth.c
>> @@ -3096,6 +3096,7 @@ static int ucc_geth_start_xmit(struct sk_buff *skb, 
>> struct net_device *dev)
>>
>> ugeth_vdbg("%s: IN", __func__);
>>
>> +   netdev_sent_queue(dev, skb->len);
>> spin_lock_irqsave(&ugeth->lock, flags);
>>
>> dev->stats.tx_bytes += skb->len;
>> @@ -3242,6 +3243,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
>> struct ucc_geth_private *ugeth = netdev_priv(dev);
>> u8 __iomem *bd; /* BD pointer */
>> u32 bd_status;
>> +   int howmany = 0;
>> +   unsigned int bytes_sent = 0;
>>
>> bd = ugeth->confBd[txQ];
>> bd_status = in_be32((u32 __iomem *)bd);
>> @@ -3257,7 +3260,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
>> skb = ugeth->tx_skbuff[txQ][ugeth->skb_dirtytx[txQ]];
>> if (!skb)
>> break;
>> -
>> +   howmany++;
>> +   bytes_sent += skb->len;
>> dev->stats.tx_packets++;
>>
>> dev_consume_skb_any(skb);
>> @@ -3279,6 +3283,7 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
>> bd_status = in_be32((u32 __iomem *)bd);
>> }
>> ugeth->confBd[txQ] = bd;
>> +   netdev_completed_queue(dev, howmany, bytes_sent);
>> return 0;
>>  }
>>
>> --
>> 2.13.6
>>



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

Re: [PATCH iproute2-next] Add tc-BPF example for a TCP pure ack recognizer

2018-05-01 Thread Dave Taht

On Tue, May 1, 2018 at 9:12 PM, David Ahern  wrote:
> On 5/1/18 12:32 PM, Dave Taht wrote:
>> ack_recognize can shift pure tcp acks into another flowid.
>> ---
>>  examples/bpf/ack_recognize.c | 98 
>> 
>>  1 file changed, 98 insertions(+)
>>  create mode 100644 examples/bpf/ack_recognize.c
>>
>> diff --git a/examples/bpf/ack_recognize.c b/examples/bpf/ack_recognize.c
>> new file mode 100644
>> index 000..5da620c
>> --- /dev/null
>> +++ b/examples/bpf/ack_recognize.c
>> @@ -0,0 +1,98 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * Copyright 2017 Google Inc.
>
> 2017?? it's 2018. Did you write this last year and just now posting?

November, 2017? Shortly prior to you taking iproute2-next off of steven's hands.

It was the first stage of a proof of concept showing (eventually) that
a simple ack decimator/filter (like "drop every other ack", or simple
"drop head" in the supplied example) had a ton of problems, and that
to filter out acks correctly to any extent, it needed to peer back
into the queue. (see the sch_cake ack-filter controversy on-going on
this list)

While it's better than what is in wondershaper, I wouldn't recommend
anyone use it. It was, however, useful in acquiring a gut feel for
what several other broken ack filters might be doing in the field. I
submitted it as a possibly useful example for showing off tc-bpf and
to add fuel to the ack-filter fire under cake.

I can clean it up, SPF it, etc, if you want it. Otherwise, sorry for the noise.

>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public License
>> + * as published by the Free Software Foundation; either version 2
>> + * of the License, or (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, write to the Free Software
>> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
>> + * 02110-1301, USA.
>> + */
>> +
>> +/*
>> + * Author: dave.t...@gmail.com (Dave Taht)
>> + *
>> + * ack_recognizer: An eBPF program that correctly recognizes modern TCP 
>> ACKs,
>> + * with tcp option fields like SACK and timestamps, and no additional data.
>> + *
>> + * ack_match call: Recognize "pure acks" with no data payload
>> + *
>> + */
>> +
>> +#include "bpf_api.h"
>> +#include "linux/if_ether.h"
>> +#include "linux/ip.h"
>> +#include "linux/in.h"
>> +#include "linux/ipv6.h"
>> +#include "linux/tcp.h"
>> +
>> +/*
>> + * A pure ack contains the ip header, the tcp header + options, flags with 
>> the
>> + * ack field set, and no additional payload. That last bit is what every 
>> prior
>> + * ack filter gets wrong, they typically assume an obsolete 64 bytes, and 
>> don't
>> + * calculate the options (like sack or timestamps) to subtract from the 
>> payload.
>> + */
>> +
>> +__section_cls_entry
>> +int ack_match(struct __sk_buff *skb)
>> +{
>> + void *data = (void *)(long)skb->data;
>> + void *data_end = (void *)(long)skb->data_end;
>> + struct ethhdr *eth = data;
>> + struct iphdr *iph = data + sizeof(*eth);
>> + struct tcphdr *tcp;
>> +
>> + if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*tcp) > data_end)
>> + return 0;
>> +
>> + if (eth->h_proto == htons(ETH_P_IP) &&
>> + iph->version == 4) {
>
> Why not make the version == 4 under the proto check?
> if (eth->h_proto == htons(ETH_P_IP) {
> if (iph->version != 4)
> return 0;
> if the eth proto is ETH_P_IP, and version != 4 something is really wrong
>
>> + if(iph->protocol == IPPROTO_TCP &&
>> +iph->ihl == 5 &&
>> +data + sizeof(*eth) + 20 + sizeof(*tcp) <= data_end) {
>> + tcp = data + sizeof(*eth) + 20;
>> + if (tcp->ack &&
>> + htons(iph->tot_len) == 20 + t

Re: [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc

2018-05-01 Thread Dave Taht

On Tue, May 1, 2018 at 11:53 AM, Toke Høiland-Jørgensen  wrote:
> Eric Dumazet  writes:
>
>> On 04/30/2018 02:27 PM, Dave Taht wrote:
>>
>>> I actually have a tc - bpf based ack filter, during the development of
>>> cake's ack-thinner, that I should submit one of these days. It
>>> proved to be of limited use.
>>>
>>> Probably the biggest mistake we made is by calling this cake feature a
>>> filter. It isn't.
>>>
>>> Maybe we should have called it a "thinner" or something like that? In
>>> order to properly "thin" or "reduce" an ack stream
>>> you have to have a queue to look at and some related state. TC filters
>>> do not operate on queues, qdiscs do. Thus the "ack-filter" here is
>>> deeply embedded into cake's flow isolation and queue structures.
>>
>> A feature eating packets _is_ a filter.
>>
>> Given that a qdisc only sees one direction, I really do not get why
>> ack-filter is so desirable in a packet scheduler.
>
> The ACK filter in CAKE is there to solve a very particular use case:
> Residential internet connections with bandwidths so asymmetric that it
> hurts TCP performance. It is not on by default; but rather meant to be
> configured by users which suffer from this particular ISP brokenness
> (which, sadly, does happen; there are ISPs who believe a 50/1 bandwidth
> ratio is reasonable). For those users, the ACK filter can help.
>
> We certainly do not advise people to turn it on in the general case! As
> you point, in general TCP performance is best improved in the TCP stack...
>
>> You have not provided any numbers to show how useful it is to maintain
>> this code
>
> You mean apart from that in the linked blog post and the paper?
> Here's an executive summary:
>
> On a 30/1 Mbps connection with a bidirectional traffic test (RRUL),
> turning on the ACK filter improves downstream throughput by ~20% and
> upstream throughput by ~12% in conservative mode and ~40% in aggressive
> mode, at the cost of ~5ms of inter-flow latency due to the increased
> congestion.

On a simulated 50/1 comcast connection, I got double the throughput
on a similar test, with no obvious glitches in the tcp flow, in the early stages
of development.

http://blog.cerowrt.org/post/ack_filtering/

I was very, very dubious about the value of ack thinning until that point,
but it was hard to argue with the data.

> In *really* pathological cases, the effect can be a lot more; for
> instance, the ACK filter increases the achievable downstream throughput
> on a link with 100 Kbps in the upstream direction by an order of
> magnitude (from ~2.5 Mbps to ~25 Mbps).
>
>> (probably still broken btw, considering it is changing some skb
>> attributes).
>
> As you may have noticed over the last few iterations, I've actually been
> trying to fix any brokenness. If you could be specific as to what is
> still broken, that would be helpful.
>
> (I'm assuming are referring to the calls to skb_set_inner* - but do you
> mean all of them, or just the ones that set inner == outer?)
>
>> On wifi (or any half duplex medium), you might gain something by
>> carefully sending ACK not too often, but ultimately this should be
>> done by TCP stack, in well controlled environment [1], instead of
>> middle-boxes happily playing/breaking some Internet standards.
>>
>> [1] TCP stack has the estimations of RTT, RWIN, throughput, and might
>> be able to avoid flooding the network with acks under some conditions.
>
> You are quite right that in general, TCP performance is best improved in
> the TCP stack. But home users are not generally in control of that; the
> ACK filter helps in those specific deployments (again, it's off by
> default, and you shouldn't turn it on in the general case!).
>
>> Say RTT is 100ms, and we receive 1 packet every 100 usec (no GRO
>> aggregation) Maybe we do not really _need_ to send 5000 ack per second
>> (or even 10,000 ack per second if a hole needs a repair)
>
> Yes, please do fix that.

:) I really would like to see cake tested at 10GigE and above, and its
performance improved overall. I tend to think we need more queues than 1024
at 40GigE+, and we presently run out of cpu (even unshaped) long
before we hit that point.

>
>> Also on wifi, the queue builds in the driver queues anyway, not in the
>> qdisc.
>
> Actually, we've fixed that (for some drivers); there is now no qdisc on
> WiFi interfaces, and we've integrated FQ-CoDel'ed queueing into the
> stack where it can be effective. But no, running CAKE with an ACK filter
> on a WiFi link is not

[PATCH iproute2-next] Add tc-BPF example for a TCP pure ack recognizer

2018-05-01 Thread Dave Taht

ack_recognize can shift pure tcp acks into another flowid.
---
 examples/bpf/ack_recognize.c | 98 
 1 file changed, 98 insertions(+)
 create mode 100644 examples/bpf/ack_recognize.c

diff --git a/examples/bpf/ack_recognize.c b/examples/bpf/ack_recognize.c
new file mode 100644
index 000..5da620c
--- /dev/null
+++ b/examples/bpf/ack_recognize.c
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2017 Google Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA.
+ */
+
+/*
+ * Author: dave.t...@gmail.com (Dave Taht)
+ *
+ * ack_recognizer: An eBPF program that correctly recognizes modern TCP ACKs,
+ * with tcp option fields like SACK and timestamps, and no additional data.
+ *
+ * ack_match call: Recognize "pure acks" with no data payload
+ *
+ */
+
+#include "bpf_api.h"
+#include "linux/if_ether.h"
+#include "linux/ip.h"
+#include "linux/in.h"
+#include "linux/ipv6.h"
+#include "linux/tcp.h"
+
+/*
+ * A pure ack contains the ip header, the tcp header + options, flags with the
+ * ack field set, and no additional payload. That last bit is what every prior
+ * ack filter gets wrong, they typically assume an obsolete 64 bytes, and don't
+ * calculate the options (like sack or timestamps) to subtract from the 
payload.
+ */
+
+__section_cls_entry
+int ack_match(struct __sk_buff *skb)
+{
+   void *data = (void *)(long)skb->data;
+   void *data_end = (void *)(long)skb->data_end;
+   struct ethhdr *eth = data;
+   struct iphdr *iph = data + sizeof(*eth);
+   struct tcphdr *tcp;
+
+   if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*tcp) > data_end)
+   return 0;
+
+   if (eth->h_proto == htons(ETH_P_IP) &&
+   iph->version == 4) {
+   if(iph->protocol == IPPROTO_TCP &&
+  iph->ihl == 5 &&
+  data + sizeof(*eth) + 20 + sizeof(*tcp) <= data_end) {
+   tcp = data + sizeof(*eth) + 20;
+   if (tcp->ack &&
+   htons(iph->tot_len) == 20 + tcp->doff*4)
+   return -1;
+   }
+   } else if (eth->h_proto == htons(ETH_P_IPV6) &&
+  iph->version == 6) {
+   struct ipv6hdr *iph6 = (struct ipv6hdr *) iph;
+   if(iph6->nexthdr == IPPROTO_TCP &&
+  data + sizeof(*eth) + 40 + sizeof(*tcp) <= data_end ) {
+   tcp = data + sizeof(*eth) + 40;
+   if (tcp->ack &&
+   tcp->doff*4 == htons(iph6->payload_len))
+   return -1;
+   }
+   }
+
+   return 0;
+}
+
+/* Example: Move acks into a priority queue:
+
+IFACE=eth0
+tc qdisc del dev $IFACE root 2> /dev/null
+tc qdisc add dev $IFACE root handle 1: prio bands 3 \
+   priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
+tc qdisc add dev $IFACE parent 1:1 handle 10:1 sfq headdrop # acks only
+tc qdisc add dev $IFACE parent 1:2 handle 20:1 fq_codel # all other traffic
+tc qdisc add dev $IFACE parent 1:3 handle 30:1 fq_codel # unused
+tc filter add dev $IFACE parent 1: prio 1 bpf \
+   object-file ack_recognize.o flowid 1:1
+
+Please note that a strict priority queue is not a good idea (drr would be
+better), nor is doing any level of prioritization on acks at all
+*/
+
+BPF_LICENSE("GPL");
-- 
2.7.4

Re: [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc

2018-04-30 Thread Dave Taht

On Mon, Apr 30, 2018 at 2:21 PM, Cong Wang  wrote:
> On Sun, Apr 29, 2018 at 2:34 PM, Toke Høiland-Jørgensen  wrote:
>> sch_cake targets the home router use case and is intended to squeeze the
>> most bandwidth and latency out of even the slowest ISP links and routers,
>> while presenting an API simple enough that even an ISP can configure it.
>>
>> Example of use on a cable ISP uplink:
>>
>> tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter
>>
>> To shape a cable download link (ifb and tc-mirred setup elided)
>>
>> tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash
>>
>> CAKE is filled with:
>>
>> * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
>>   derived Flow Queuing system, which autoconfigures based on the bandwidth.
>> * A novel "triple-isolate" mode (the default) which balances per-host
>>   and per-flow FQ even through NAT.
>> * An deficit based shaper, that can also be used in an unlimited mode.
>> * 8 way set associative hashing to reduce flow collisions to a minimum.
>> * A reasonable interpretation of various diffserv latency/loss tradeoffs.
>> * Support for zeroing diffserv markings for entering and exiting traffic.
>> * Support for interacting well with Docsis 3.0 shaper framing.
>> * Extensive support for DSL framing types.
>> * Support for ack filtering.
>
> Why this TCP ACK filtering has to be built into CAKE qdisc rather than
> an independent TC filter? Why other qdisc's can't use it?

I actually have a tc - bpf based ack filter, during the development of
cake's ack-thinner, that I should submit one of these days. It
proved to be of limited use.

Probably the biggest mistake we made is by calling this cake feature a
filter. It isn't.

Maybe we should have called it a "thinner" or something like that? In
order to properly "thin" or "reduce" an ack stream
you have to have a queue to look at and some related state. TC filters
do not operate on queues, qdiscs do. Thus the "ack-filter" here is
deeply embedded into cake's flow isolation and queue structures.

>
>
>> * Extensive statistics for measuring, loss, ecn markings, latency
>>   variation.
>>
>> A paper describing the design of CAKE is available at
>> https://arxiv.org/abs/1804.07617
>>
>
> Thanks.



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

Re: [bug, bisected] pfifo_fast causes packet reordering

2018-03-13 Thread Dave Taht

On Tue, Mar 13, 2018 at 11:24 AM, Jakob Unterwurzacher
 wrote:
> During stress-testing our "ucan" USB/CAN adapter SocketCAN driver on Linux
> v4.16-rc4-383-ged58d66f60b3 we observed that a small fraction of packets are
> delivered out-of-order.
>
> We have tracked the problem down to the driver interface level, and it seems
> that the driver's net_device_ops.ndo_start_xmit() function gets the packets
> handed over in the wrong order.
>
> This behavior was not observed on Linux v4.15 and I have bisected the
> problem down to this patch:
>
>> commit c5ad119fb6c09b0297446be05bd66602fa564758
>> Author: John Fastabend 
>> Date:   Thu Dec 7 09:58:19 2017 -0800
>>
>>net: sched: pfifo_fast use skb_array
>>
>>This converts the pfifo_fast qdisc to use the skb_array data structure
>>and set the lockless qdisc bit. pfifo_fast is the first qdisc to
>> support
>>the lockless bit that can be a child of a qdisc requiring locking. So
>>we add logic to clear the lock bit on initialization in these cases
>> when
>>the qdisc graft operation occurs.
>>
>>This also removes the logic used to pick the next band to dequeue from
>>and instead just checks a per priority array for packets from top
>> priority
>>to lowest. This might need to be a bit more clever but seems to work
>>for now.
>>
>>Signed-off-by: John Fastabend 
>>Signed-off-by: David S. Miller 
>
>
> The patch does not revert cleanly, but moving to one commit earlier makes
> the problem go away.
>
> Selecting the "fq" scheduler instead of "pfifo_fast" makes the problem go
> away as well.

I am of course, a fan of obsoleting pfifo_fast. There's no good reason
for it anymore.

>
> Is this an unintended side-effect of the patch or is there something the
> driver has to do to request in-order delivery?
>
> Thanks,
> Jakob



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

Re: [Bloat] Linux network is damn fast, need more use XDP (Was: DC behaviors today)

2017-12-04 Thread Dave Taht

Jesper:

I have a tendency to deal with netdev by itself and never cross post
there, as the bufferbloat.net servers (primarily to combat spam)
mandate starttls and vger doesn't support it at all, thus leading to
raising davem blood pressure which I'd rather not do.

But moving on...

On Mon, Dec 4, 2017 at 2:56 AM, Jesper Dangaard Brouer
 wrote:
>
> On Sun, 03 Dec 2017 20:19:33 -0800 Dave Taht  wrote:
>
>> Changing the topic, adding bloat.
>
> Adding netdev, and also adjust the topic to be a rant on that the Linux
> kernel network stack is actually damn fast, and if you need something
> faster then XDP can solved your needs...
>
>> Joel Wirāmu Pauling  writes:
>>
>> > Just from a Telco/Industry perspective slant.
>> >
>> > Everything in DC has moved to SFP28 interfaces at 25Gbit as the server
>> > port of interconnect. Everything TOR wise is now QSFP28 - 100Gbit.
>> > Mellanox X5 cards are the current hotness, and their offload
>> > enhancements (ASAP2 - which is sorta like DPDK on steroids) allows for
>> > OVS flow rules programming into the card. We have a lot of customers
>> > chomping at the bit for that feature (disclaimer I work for Nuage
>> > Networks, and we are working on enhanced OVS to do just that) for NFV
>> > workloads.
>>
>> What Jesper's been working on for ages has been to try and get linux's
>> PPS up for small packets, which last I heard was hovering at about
>> 4Gbits.
>
> I hope you made a typo here Dave, the normal Linux kernel is definitely
> way beyond 4Gbit/s, you must have misunderstood something, maybe you
> meant 40Gbit/s? (which is also too low)

The context here was PPS for *non-gro'd* tcp ack packets, in the
further context of
the increasingly epic "benefits of ack filtering" thread on the bloat
list, in the context
that for 50x1 end-user-asymmetry we were seeing 90% less acks with the new
sch_cake ack-filter code, double the throughput...

The kind of return traffic you see from data sent outside the DC, with
tons of flows.

What's that number?

>
> Scaling up to more CPUs and TCP-stream, Tariq[1] and I have showed the
> Linux kernel network stack scales to 94Gbit/s (linerate minus overhead).
> But when the drivers page-recycler fails, we hit bottlenecks in the
> page-allocator, that cause negative scaling to around 43Gbit/s.

So I divide by 94/22 and get 4gbit for acks. Or I look at PPS * 66. Or?

> [1] http://lkml.kernel.org/r/cef85936-10b2-5d76-9f97-cb03b418f...@mellanox.com
>
> Linux have for a _long_ time been doing 10Gbit/s TCP-stream easily, on
> a SINGLE CPU.  This is mostly thanks to TSO/GRO aggregating packets,
> but last couple of years the network stack have been optimized (with
> UDP workloads), and as a result we can do 10G without TSO/GRO on a
> single-CPU.  This is "only" 812Kpps with MTU size frames.

acks.

> It is important to NOTICE that I'm mostly talking about SINGLE-CPU
> performance.  But the Linux kernel scales very well to more CPUs, and
> you can scale this up, although we are starting to hit scalability
> issues in MM-land[1].
>
> I've also demonstrated that netdev-community have optimized the kernels
> per-CPU processing power to around 2Mpps.  What does this really
> mean... well with MTU size packets 812Kpps was 10Gbit/s, thus 25Gbit/s
> should be around 2Mpps That implies Linux can do 25Gbit/s on a
> single CPU without GRO (MTU size frames).  Do you need more I ask?

The benchmark I had in mind was, say, 100k flows going out over the internet,
and the characteristics of the ack flows on the return path.

>
>
>> The route table lookup also really expensive on the main cpu.

To clarify the context here, I was asking specifically if the X5 mellonox card
did routing table offlload or only switching.

> Well, it used-to-be very expensive. Vincent Bernat wrote some excellent
> blogposts[2][3] on the recent improvements over kernel versions, and
> gave due credit to people involved.
>
> [2] 
> https://vincent.bernat.im/en/blog/2017-performance-progression-ipv4-route-lookup-linux
> [3] 
> https://vincent.bernat.im/en/blog/2017-performance-progression-ipv6-route-lookup-linux
>
> He measured around 25 to 35 nanosec cost of route lookups.  My own
> recent measurements were 36.9 ns cost of fib_table_lookup.

On intel hw.

>
>> Does this stuff offload the route table lookup also?
>
> If you have not heard, the netdev-community have worked on something
> called XDP (eXpress Data Path).  This is a new layer in the network
> stack, that basically operates a the same "layer"/level as DPDK.
> Thus, surprise we get the same performance numbers as DPDK. E.g. I can
> do 13.4 Mpps forwar

[PATCH iproute2 net-next] Add support for cake qdisc

2017-12-03 Thread Dave Taht

sch_cake is intended to squeeze the most bandwidth and latency out of even
the slowest ISP links and routers, while presenting an API simple enough
that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash besteffort

Cake is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
  derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
  and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Support for DSL framing types and shapers.
* (New) Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency variation.

There are some features still considered experimental, notably the
ingress_autorate bandwidth estimator and cobalt itself.

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

Cake's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the c...@lists.bufferbloat.net mailing list.

Signed-off-by: Dave Taht 
---
 man/man8/tc-cake.8 | 678 ++
 tc/Makefile|   1 +
 tc/q_cake.c| 771 +
 3 files changed, 1450 insertions(+)
 create mode 100644 man/man8/tc-cake.8
 create mode 100644 tc/q_cake.c

diff --git a/man/man8/tc-cake.8 b/man/man8/tc-cake.8
new file mode 100644
index 000..788e046
--- /dev/null
+++ b/man/man8/tc-cake.8
@@ -0,0 +1,678 @@
+.TH CAKE 8 "23 November 2017" "iproute2" "Linux"
+.SH NAME
+CAKE \- COMMON Applications Kept Enhanced (CAKE)
+.SH SYNOPSIS
+.B tc qdisc ... cake
+.br
+[
+.BR bandwidth
+RATE |
+.BR unlimited*
+|
+.BR autorate_ingress
+]
+.br
+[
+.BR rtt
+TIME |
+.BR datacentre
+|
+.BR lan
+|
+.BR metro
+|
+.BR regional
+|
+.BR internet*
+|
+.BR oceanic
+|
+.BR satellite
+|
+.BR interplanetary
+]
+.br
+[
+.BR besteffort
+|
+.BR diffserv8
+|
+.BR diffserv4
+|
+.BR diffserv-llt
+|
+.BR diffserv3*
+]
+.br
+[
+.BR flowblind
+|
+.BR srchost
+|
+.BR dsthost
+|
+.BR hosts
+|
+.BR flows
+|
+.BR dual-srchost
+|
+.BR dual-dsthost
+|
+.BR triple-isolate*
+]
+.br
+[
+.BR nat
+|
+.BR nonat*
+]
+.br
+[
+.BR wash
+|
+.BR nowash*
+]
+.br
+[
+.BR ack-filter
+|
+.BR ack-filter-aggressive
+|
+.BR no-ack-filter*
+]
+.br
+[
+.BR memlimit
+LIMIT ]
+.br
+[
+.BR ptm
+|
+.BR atm
+|
+.BR noatm*
+]
+.br
+[
+.BR overhead
+N |
+.BR conservative
+|
+.BR raw*
+]
+.br
+[
+.BR mpu
+N ]
+.br
+[
+.BR ingress
+|
+.BR egress*
+]
+.br
+(* marks defaults)
+
+
+.SH DESCRIPTION
+CAKE (Common Applications Kept Enhanced) is a shaping-capable queue discipline
+which uses both AQM and FQ.  It combines COBALT, which is an AQM algorithm
+combining Codel and BLUE, a shaper which operates in deficit mode, and a 
variant
+of DRR++ for flow isolation.  8-way set-associative hashing is used to 
virtually
+eliminate hash collisions.  Priority queuing is available through a simplified
+diffserv implementation.  Overhead compensation for various encapsulation
+schemes is tightly integrated.
+
+All settings are optional; the default settings are chosen to be sensible in
+most common deployments.  Most people will only need to set the
+.B bandwidth
+parameter to get useful results, but reading the
+.B Overhead Compensation
+and
+.B Round Trip Time
+sections is strongly encouraged.
+
+.SH SHAPER PARAMETERS
+CAKE uses a deficit-mode shaper, which does not exhibit the initial burst
+typical of token-bucket shapers.  It will automatically burst precisely as much
+as required to maintain the configured throughput.  As such, it is very
+straightforward to configure.
+.PP
+.B unlimited
+(default)
+.br
+   No limit on the bandwidth.
+.PP
+.B bandwidth
+RATE
+.br
+   Set the shaper bandwidth.  See
+.BR tc(8)
+or examples below for details of the RATE value.
+.PP
+.B autorate_ingress
+.br

[PATCH iproute2 net-next] Add support and man page for cake qdisc

2017-12-03 Thread Dave Taht

Add support and man page for cake qdisc.

Dave Taht (1):
  Add support for cake qdisc

 man/man8/tc-cake.8 | 678 ++
 tc/Makefile|   1 +
 tc/q_cake.c| 771 +
 3 files changed, 1450 insertions(+)
 create mode 100644 man/man8/tc-cake.8
 create mode 100644 tc/q_cake.c

-- 
2.7.4

[PATCH net-next 1/3] pkt_sched.h: add support for sch_cake API

2017-12-03 Thread Dave Taht

Signed-off-by: Dave Taht 
---
 include/uapi/linux/pkt_sched.h | 58 ++
 1 file changed, 58 insertions(+)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index af3cc2f..ed7c111 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -935,4 +935,62 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+/* CAKE */
+enum {
+   TCA_CAKE_UNSPEC,
+   TCA_CAKE_BASE_RATE,
+   TCA_CAKE_DIFFSERV_MODE,
+   TCA_CAKE_ATM,
+   TCA_CAKE_FLOW_MODE,
+   TCA_CAKE_OVERHEAD,
+   TCA_CAKE_RTT,
+   TCA_CAKE_TARGET,
+   TCA_CAKE_AUTORATE,
+   TCA_CAKE_MEMORY,
+   TCA_CAKE_NAT,
+   TCA_CAKE_ETHERNET,
+   TCA_CAKE_WASH,
+   TCA_CAKE_MPU,
+   TCA_CAKE_INGRESS,
+   TCA_CAKE_ACK_FILTER,
+   __TCA_CAKE_MAX
+};
+#define TCA_CAKE_MAX   (__TCA_CAKE_MAX - 1)
+
+struct tc_cake_traffic_stats {
+   __u32 packets;
+   __u32 link_ms;
+   __u64 bytes;
+};
+
+#define TC_CAKE_MAX_TINS (8)
+struct tc_cake_xstats {
+   __u16 version;  /* == 5, increments when struct extended */
+   __u8  max_tins; /* == TC_CAKE_MAX_TINS */
+   __u8  tin_cnt;  /* <= TC_CAKE_MAX_TINS */
+
+   __u32 threshold_rate[TC_CAKE_MAX_TINS];
+   __u32 target_us[TC_CAKE_MAX_TINS];
+   struct tc_cake_traffic_stats sent[TC_CAKE_MAX_TINS];
+   struct tc_cake_traffic_stats dropped[TC_CAKE_MAX_TINS];
+   struct tc_cake_traffic_stats ecn_marked[TC_CAKE_MAX_TINS];
+   struct tc_cake_traffic_stats backlog[TC_CAKE_MAX_TINS];
+   __u32 interval_us[TC_CAKE_MAX_TINS];
+   __u32 way_indirect_hits[TC_CAKE_MAX_TINS];
+   __u32 way_misses[TC_CAKE_MAX_TINS];
+   __u32 way_collisions[TC_CAKE_MAX_TINS];
+   __u32 peak_delay_us[TC_CAKE_MAX_TINS]; /* ~= bulk flow delay */
+   __u32 avge_delay_us[TC_CAKE_MAX_TINS];
+   __u32 base_delay_us[TC_CAKE_MAX_TINS]; /* ~= sparse flows delay */
+   __u16 sparse_flows[TC_CAKE_MAX_TINS];
+   __u16 bulk_flows[TC_CAKE_MAX_TINS];
+   __u16 unresponse_flows[TC_CAKE_MAX_TINS]; /* v4 - was u32 last_len */
+   __u16 spare[TC_CAKE_MAX_TINS]; /* v4 - split last_len */
+   __u32 max_skblen[TC_CAKE_MAX_TINS];
+   __u32 capacity_estimate;  /* version 2 */
+   __u32 memory_limit;   /* version 3 */
+   __u32 memory_used;/* version 3 */
+   struct tc_cake_traffic_stats ack_drops[TC_CAKE_MAX_TINS]; /* v5 */
+};
+
 #endif
-- 
2.7.4

[PATCH net-next 0/3] Add Common Applications Kept Enhanced (cake) qdisc

2017-12-03 Thread Dave Taht

sch_cake is intended to squeeze the most bandwidth and latency out of even
the slowest ISP links and routers, while presenting an API simple enough
that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash besteffort

Cake is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
  derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
  and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Support for DSL framing types and shapers.
* (New) Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency variation.

There are some features still considered experimental, notably the
ingress_autorate bandwidth estimator and cobalt itself.

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

Cake's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the c...@lists.bufferbloat.net mailing list.

Dave Taht (3):
  pkt_sched.h: add support for sch_cake API
  Add Common Applications Kept Enhanced (cake) qdisc
  Add support for building the new cake qdisc

 include/net/cobalt.h   |  152 +++
 include/uapi/linux/pkt_sched.h |   58 +
 net/sched/Kconfig  |   11 +
 net/sched/Makefile |1 +
 net/sched/sch_cake.c   | 2561 
 5 files changed, 2783 insertions(+)
 create mode 100644 include/net/cobalt.h
 create mode 100644 net/sched/sch_cake.c

-- 
2.7.4

[PATCH net-next 3/3] Add support for building the new cake qdisc

2017-12-03 Thread Dave Taht

Hook up sch_cake to the build system.

Signed-off-by: Dave Taht 
---
 net/sched/Kconfig  | 11 +++
 net/sched/Makefile |  1 +
 2 files changed, 12 insertions(+)

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index c03d86a..3ea22e5 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -284,6 +284,17 @@ config NET_SCH_FQ_CODEL
 
  If unsure, say N.
 
+config NET_SCH_CAKE
+   tristate "Common Applicatons Kept Enhanced (CAKE)"
+   help
+ Say Y here if you want to use the CAKE
+ packet scheduling algorithm.
+
+ To compile this driver as a module, choose M here: the module
+ will be called sch_cake.
+
+ If unsure, say N.
+
 config NET_SCH_FQ
tristate "Fair Queue"
help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 5b63544..b8dd962 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -50,6 +50,7 @@ obj-$(CONFIG_NET_SCH_CHOKE)   += sch_choke.o
 obj-$(CONFIG_NET_SCH_QFQ)  += sch_qfq.o
 obj-$(CONFIG_NET_SCH_CODEL)+= sch_codel.o
 obj-$(CONFIG_NET_SCH_FQ_CODEL) += sch_fq_codel.o
+obj-$(CONFIG_NET_SCH_CAKE) += sch_cake.o
 obj-$(CONFIG_NET_SCH_FQ)   += sch_fq.o
 obj-$(CONFIG_NET_SCH_HHF)  += sch_hhf.o
 obj-$(CONFIG_NET_SCH_PIE)  += sch_pie.o
-- 
2.7.4

[PATCH net-next 2/3] Add Common Applications Kept Enhanced (cake) qdisc

2017-12-03 Thread Dave Taht

sch_cake is intended to squeeze the most bandwidth and latency out of even
the slowest ISP links and routers, while presenting an API simple enough
that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash besteffort

Cake is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
  derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
  and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Extensive support for DSL framing types.
* (New) Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency variation.

There are some features still considered experimental, notably the
ingress_autorate bandwidth estimator and cobalt itself.

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

Cake's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the c...@lists.bufferbloat.net mailing list.

tc -s qdisc show dev eth0
qdisc cake 1: dev eth0 root refcnt 2 bandwidth 20Mbit diffserv3
 triple-isolate nat ack-filter rtt 100.0ms noatm overhead 18 via-ethernet
 total_overhead 18 hard_header_len 14 mpu 64 
 Sent 119336682 bytes 390608 pkt (dropped 126463, overlimits 882857 requeues 0) 
 backlog 23620b 29p requeues 0 
 memory used: 158598b of 4Mb
 capacity estimate: 20Mbit
 Bulk   Best Effort  Voice
  thresh  1250Kbit  20Mbit   5Mbit
  target14.5ms   5.0ms   5.0ms
  interval 109.5ms 100.0ms 100.0ms
  pk_delay   6.8ms  12.8ms   9.8ms
  av_delay   2.5ms   2.6ms   2.6ms
  sp_delay   538us   256us   350us
  pkts   80534  339655   96911
  bytes   121659159031448731167998
  way_inds   0   0   0
  way_miss   6  79   7
  way_cols   0   0   0
  drops   11612303 480
  marks  0   0   0
  ack_drop   45420   690908009
  sp_flows   2   5   2
  bk_flows   4  16   3
  un_flows   0   0   0
  max_len 302830283028

Signed-off-by: Dave Taht 
Tested-by: Pete Heist 
Tested-by: Georgios Amanakis  

---
 include/net/cobalt.h |  152 +++
 net/sched/sch_cake.c | 2561 ++
 2 files changed, 2713 insertions(+)
 create mode 100644 include/net/cobalt.h
 create mode 100644 net/sched/sch_cake.c

diff --git a/include/net/cobalt.h b/include/net/cobalt.h
new file mode 100644
index 000..b2559df
--- /dev/null
+++ b/include/net/cobalt.h
@@ -0,0 +1,152 @@
+#ifndef __NET_SCHED_COBALT_H
+#define __NET_SCHED_COBALT_H
+
+/* COBALT - Codel-BLUE Alternate AQM algorithm.
+ *
+ *  Copyright (C) 2011-2012 Kathleen Nichols 
+ *  Copyright (C) 2011-2012 Van Jacobson 
+ *  Copyright (C) 2012 Eric Dumazet 
+ *  Copyright (C) 2016-2017 Michael D. Täht 
+ *  Copyright (c) 2015-2017 Jonathan Morton 
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, this list of conditions, and the following disclaimer,
+ *without modification.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ * 3. The names of the authors may not be used to endorse or promote products
+ *derived from this software without specific prior written permission.
+ *
+ * Alternatively, provided that this notice is retained in full,

Re: [PATCH net-next v2 0/2] netem: fix compilation on 32 bit

2017-11-14 Thread Dave Taht

On Tue, Nov 14, 2017 at 11:40 AM, Randy Dunlap  wrote:
> On 11/14/2017 11:27 AM, Stephen Hemminger wrote:
>> A couple of places where 64 bit CPU was being assumed incorrectly.
>>
>> Stephen Hemminger (2):
>>   netem: fix 64 bit divide
>>   netem: remove unnecessary 64 bit modulus
>>
>>  net/sched/sch_netem.c | 17 +++--
>>  1 file changed, 7 insertions(+), 10 deletions(-)
>>
>
> Acked-by: Randy Dunlap 

Acked-by: Dave Taht 

Thx.

Re: Per-CPU Queueing for QoS

2017-11-13 Thread Dave Taht

I have been thinking we'd try to submit sch_cake to mainline on this
go-around. It's been out of tree for way too long.

I look forward to understanding your patches soon in the tbf case.

(I'm only responding because cake uses deficit, rather than a token
bucket, scheduler, and is not reliant on the tc filter infrastructure
for its queuing, and I'd love to have it handle multiple cpus better.
)

repo: https://github.com/dtaht/sch_cake.git

doc: https://www.bufferbloat.net/projects/codel/wiki/CakeTechnical/

Re: [PATCH v3 net-next 2/3] netem: add uapi to express delay and jitter in nanoseconds

2017-11-08 Thread Dave Taht

On Wed, Nov 8, 2017 at 3:24 PM, Stephen Hemminger
 wrote:
> On Wed,  8 Nov 2017 15:12:27 -0800
> Dave Taht  wrote:
>
>> --- a/net/sched/sch_netem.c
>> +++ b/net/sched/sch_netem.c
>> @@ -819,6 +819,8 @@ static const struct nla_policy 
>> netem_policy[TCA_NETEM_MAX + 1] = {
>>   [TCA_NETEM_LOSS]= { .type = NLA_NESTED },
>>   [TCA_NETEM_ECN] = { .type = NLA_U32 },
>>   [TCA_NETEM_RATE64]  = { .type = NLA_U64 },
>> + [TCA_NETEM_LATENCY64]   = { .type = NLA_S64 },
>> + [TCA_NETEM_JITTER64]= { .type = NLA_S64 },
>>  };
>>
>>  static int parse_attr(struct nlattr *tb[], int maxtype, struct nlattr *nla,
>> @@ -916,6 +918,12 @@ static int netem_change(struct Qdisc *sch, struct 
>> nlattr *opt)
>>   q->rate = max_t(u64, q->rate,
>>   nla_get_u64(tb[TCA_NETEM_RATE64]));
>>
>> + if (tb[TCA_NETEM_LATENCY64])
>> + q->latency = nla_get_s64(tb[TCA_NETEM_LATENCY64]);
>> +
>> + if (tb[TCA_NETEM_JITTER64])
>> + q->jitter = nla_get_s64(tb[TCA_NETEM_JITTER64]);
>> +
>>   if (tb[TCA_NETEM_ECN])
>>   q->ecn = nla_get_u32(tb[TCA_NETEM_ECN]);
>>
>
> Although some of the maths use signed 64 bit.
> I think the API should be unsigned 64 bit.  Or do you want to allow
> negative latency?

Personally I find things simpler to reason about when signed, and the
userspace side of the code (currently) offers the ability to generically
have signed time values for "other stuff".

The constrained range of 63 vs 64 bits we can debate in 272 years or so.

I'll let eric cast the tie vote.

-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

[PATCH v3 net-next 2/3] netem: add uapi to express delay and jitter in nanoseconds

2017-11-08 Thread Dave Taht

netem userspace has long relied on a horrible /proc/net/psched hack
to translate the current notion of "ticks" to nanoseconds.

Expressing latency and jitter instead, in well defined nanoseconds,
increases the dynamic range of emulated delays and jitter in netem.

It will also ease a transition where reducing a tick to nsec
equivalence would constrain the max delay in prior versions of
netem to only 4.3 seconds.

Signed-off-by: Dave Taht 
Suggested-by: Eric Dumazet 
Reviewed-by: Eric Dumazet 
---
 include/uapi/linux/pkt_sched.h |  2 ++
 net/sched/sch_netem.c  | 14 ++
 2 files changed, 16 insertions(+)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 6a2c5ea..8fe6d18 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -537,6 +537,8 @@ enum {
TCA_NETEM_ECN,
TCA_NETEM_RATE64,
TCA_NETEM_PAD,
+   TCA_NETEM_LATENCY64,
+   TCA_NETEM_JITTER64,
__TCA_NETEM_MAX,
 };
 
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index e64e0e0..47d6dec 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -819,6 +819,8 @@ static const struct nla_policy netem_policy[TCA_NETEM_MAX + 
1] = {
[TCA_NETEM_LOSS]= { .type = NLA_NESTED },
[TCA_NETEM_ECN] = { .type = NLA_U32 },
[TCA_NETEM_RATE64]  = { .type = NLA_U64 },
+   [TCA_NETEM_LATENCY64]   = { .type = NLA_S64 },
+   [TCA_NETEM_JITTER64]= { .type = NLA_S64 },
 };
 
 static int parse_attr(struct nlattr *tb[], int maxtype, struct nlattr *nla,
@@ -916,6 +918,12 @@ static int netem_change(struct Qdisc *sch, struct nlattr 
*opt)
q->rate = max_t(u64, q->rate,
nla_get_u64(tb[TCA_NETEM_RATE64]));
 
+   if (tb[TCA_NETEM_LATENCY64])
+   q->latency = nla_get_s64(tb[TCA_NETEM_LATENCY64]);
+
+   if (tb[TCA_NETEM_JITTER64])
+   q->jitter = nla_get_s64(tb[TCA_NETEM_JITTER64]);
+
if (tb[TCA_NETEM_ECN])
q->ecn = nla_get_u32(tb[TCA_NETEM_ECN]);
 
@@ -1020,6 +1028,12 @@ static int netem_dump(struct Qdisc *sch, struct sk_buff 
*skb)
if (nla_put(skb, TCA_OPTIONS, sizeof(qopt), &qopt))
goto nla_put_failure;
 
+   if (nla_put(skb, TCA_NETEM_LATENCY64, sizeof(q->latency), &q->latency))
+   goto nla_put_failure;
+
+   if (nla_put(skb, TCA_NETEM_JITTER64, sizeof(q->jitter), &q->jitter))
+   goto nla_put_failure;
+
cor.delay_corr = q->delay_cor.rho;
cor.loss_corr = q->loss_cor.rho;
cor.dup_corr = q->dup_cor.rho;
-- 
2.7.4

[PATCH v3 net-next 0/3] netem: add nsec scheduling and slot feature

2017-11-08 Thread Dave Taht

This patch series converts netem away from the old "ticks" interface and
userspace API, and adds support for a new "slot" feature intended to
emulate bursty macs such as WiFi and LTE better.

Changes since v2:
Use u64 for packet_len_sched_time()
Use simpler max(time_to_send,q->slot.slot_next)

Changes since v1:
Always pass new nanosecond APIs to userspace

Dave Taht (3):
  netem: convert to qdisc_watchdog_schedule_ns
  netem: add uapi to express delay and jitter in nanoseconds
  netem: support delivering packets in delayed time slots

 include/uapi/linux/pkt_sched.h |  10 +++
 net/sched/sch_netem.c  | 140 -
 2 files changed, 121 insertions(+), 29 deletions(-)

-- 
2.7.4

[PATCH v3 net-next 3/3] netem: support delivering packets in delayed time slots

2017-11-08 Thread Dave Taht

Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.

It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.

The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.

The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.

Examples of use:

tc qdisc add dev eth0 root netem delay 200us \
 slot 800us 10ms bytes 64k packets 42

A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:

tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
 limit 1
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
 slot 800us 10ms bytes 64k packets 42 limit 512

Signed-off-by: Dave Taht 
---
 include/uapi/linux/pkt_sched.h |  8 +
 net/sched/sch_netem.c  | 74 --
 2 files changed, 79 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 8fe6d18..af3cc2f 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -539,6 +539,7 @@ enum {
TCA_NETEM_PAD,
TCA_NETEM_LATENCY64,
TCA_NETEM_JITTER64,
+   TCA_NETEM_SLOT,
__TCA_NETEM_MAX,
 };
 
@@ -576,6 +577,13 @@ struct tc_netem_rate {
__s32   cell_overhead;
 };
 
+struct tc_netem_slot {
+   __s64   min_delay; /* nsec */
+   __s64   max_delay;
+   __s32   max_packets;
+   __s32   max_bytes;
+};
+
 enum {
NETEM_LOSS_UNSPEC,
NETEM_LOSS_GI,  /* General Intuitive - 4 state model */
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 47d6dec..b686e75 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -135,6 +135,13 @@ struct netem_sched_data {
u32 a5; /* p23 used only in 4-states */
} clg;
 
+   struct tc_netem_slot slot_config;
+   struct slotstate {
+   u64 slot_next;
+   s32 packets_left;
+   s32 bytes_left;
+   } slot;
+
 };
 
 /* Time stamp put into socket buffer control block
@@ -591,6 +598,20 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
return NET_XMIT_SUCCESS;
 }
 
+/* Delay the next round with a new future slot with a
+ * correct number of bytes and packets.
+ */
+
+static void get_slot_next(struct netem_sched_data *q, u64 now)
+{
+   q->slot.slot_next = now + q->slot_config.min_delay +
+   (prandom_u32() *
+   (q->slot_config.max_delay -
+   q->slot_config.min_delay) >> 32);
+   q->slot.packets_left = q->slot_config.max_packets;
+   q->slot.bytes_left = q->slot_config.max_bytes;
+}
+
 static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 {
struct netem_sched_data *q = qdisc_priv(sch);
@@ -608,14 +629,17 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
p = rb_first(&q->t_root);
if (p) {
u64 time_to_send;
+   u64 now = ktime_get_ns();
 
skb = rb_to_skb(p);
 
/* if more time remaining? */
time_to_send = netem_skb_cb(skb)->time_to_send;
-   if (time_to_send <= ktime_get_ns()) {
-   rb_erase(p, &q->t_root);
+   if (q->slot.slot_next && q->slot.slot_next < time_to_send)
+   get_slot_next(q, now);
 
+   if (time_to_send <= now &&  q->slot.slot_next <= now) {
+   rb_erase(p, &q->t_root);
sch->q.qlen--;
qdisc_qstats_backlog_dec(sch, skb);
skb->next = NULL;
@@ -634,6 +658,14 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
skb->tstamp = 0;
 #endif
 
+   if (q->slot.slot_next) {
+   q->slot.packets_left--;
+   q->slot.bytes_left -= qdisc_pkt_len(skb);
+   if (q->slot.packets_left <= 0 ||
+   q->slot.bytes_left <= 0)
+   get_slot_next(q, now);
+   }
+
if (q->qdisc) {
unsigned int pkt_len = qdisc_pkt_len(skb);
struct sk_buff *to_free = NULL;

[PATCH v3 net-next 1/3] netem: convert to qdisc_watchdog_schedule_ns

2017-11-08 Thread Dave Taht

Upgrade the internal netem scheduler to use nanoseconds rather than
ticks throughout.

Convert to and from the std "ticks" userspace api automatically,
while allowing for finer grained scheduling to take place.

Signed-off-by: Dave Taht 
---
 net/sched/sch_netem.c | 56 +--
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index db0228a..e64e0e0 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -77,8 +77,8 @@ struct netem_sched_data {
 
struct qdisc_watchdog watchdog;
 
-   psched_tdiff_t latency;
-   psched_tdiff_t jitter;
+   s64 latency;
+   s64 jitter;
 
u32 loss;
u32 ecn;
@@ -145,7 +145,7 @@ struct netem_sched_data {
  * we save skb->tstamp value in skb->cb[] before destroying it.
  */
 struct netem_skb_cb {
-   psched_time_t   time_to_send;
+   u64 time_to_send;
 };
 
 static inline struct netem_skb_cb *netem_skb_cb(struct sk_buff *skb)
@@ -305,11 +305,11 @@ static bool loss_event(struct netem_sched_data *q)
  * std deviation sigma.  Uses table lookup to approximate the desired
  * distribution, and a uniformly-distributed pseudo-random source.
  */
-static psched_tdiff_t tabledist(psched_tdiff_t mu, psched_tdiff_t sigma,
-   struct crndstate *state,
-   const struct disttable *dist)
+static s64 tabledist(s64 mu, s64 sigma,
+struct crndstate *state,
+const struct disttable *dist)
 {
-   psched_tdiff_t x;
+   s64 x;
long t;
u32 rnd;
 
@@ -332,10 +332,10 @@ static psched_tdiff_t tabledist(psched_tdiff_t mu, 
psched_tdiff_t sigma,
return  x / NETEM_DIST_SCALE + (sigma / NETEM_DIST_SCALE) * t + mu;
 }
 
-static psched_time_t packet_len_2_sched_time(unsigned int len, struct 
netem_sched_data *q)
+static u64 packet_len_2_sched_time(unsigned int len,
+  struct netem_sched_data *q)
 {
-   u64 ticks;
-
+   u64 offset;
len += q->packet_overhead;
 
if (q->cell_size) {
@@ -345,11 +345,9 @@ static psched_time_t packet_len_2_sched_time(unsigned int 
len, struct netem_sche
cells++;
len = cells * (q->cell_size + q->cell_overhead);
}
-
-   ticks = (u64)len * NSEC_PER_SEC;
-
-   do_div(ticks, q->rate);
-   return PSCHED_NS2TICKS(ticks);
+   offset = (u64)len * NSEC_PER_SEC;
+   do_div(offset, q->rate);
+   return offset;
 }
 
 static void tfifo_reset(struct Qdisc *sch)
@@ -369,7 +367,7 @@ static void tfifo_reset(struct Qdisc *sch)
 static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch)
 {
struct netem_sched_data *q = qdisc_priv(sch);
-   psched_time_t tnext = netem_skb_cb(nskb)->time_to_send;
+   u64 tnext = netem_skb_cb(nskb)->time_to_send;
struct rb_node **p = &q->t_root.rb_node, *parent = NULL;
 
while (*p) {
@@ -515,13 +513,13 @@ static int netem_enqueue(struct sk_buff *skb, struct 
Qdisc *sch,
if (q->gap == 0 ||  /* not doing reordering */
q->counter < q->gap - 1 ||  /* inside last reordering gap */
q->reorder < get_crandom(&q->reorder_cor)) {
-   psched_time_t now;
-   psched_tdiff_t delay;
+   u64 now;
+   s64 delay;
 
delay = tabledist(q->latency, q->jitter,
  &q->delay_cor, q->delay_dist);
 
-   now = psched_get_time();
+   now = ktime_get_ns();
 
if (q->rate) {
struct netem_skb_cb *last = NULL;
@@ -547,7 +545,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
 * from delay.
 */
delay -= last->time_to_send - now;
-   delay = max_t(psched_tdiff_t, 0, delay);
+   delay = max_t(s64, 0, delay);
now = last->time_to_send;
}
 
@@ -562,7 +560,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
 * Do re-ordering by putting one out of N packets at the front
 * of the queue.
 */
-   cb->time_to_send = psched_get_time();
+   cb->time_to_send = ktime_get_ns();
q->counter = 0;
 
netem_enqueue_skb_head(&sch->q, skb);
@@ -609,13 +607,13 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
}
p = rb_first(&q->t_root);
if (p) {
-   psched_time_t time_to_send;
+   u64 time_to_send;
 
skb = rb_to_skb(p);

Re: [PATCH net-next 0/3] netem: add nsec scheduling and slot feature

2017-11-08 Thread Dave Taht

On Tue, Nov 7, 2017 at 4:26 PM, Stephen Hemminger
 wrote:
> On Tue,  7 Nov 2017 12:59:33 -0800
> Dave Taht  wrote:
>
>> This patch series converts netem away from the old "ticks" interface and
>> userspace API, and adds support for a new "slot" feature intended to
>> emulate bursty macs such as WiFi and LTE better.
>>
>> Dave Taht (3):
>>   netem: convert to qdisc_watchdog_schedule_ns
>>   netem: add uapi to express delay and jitter in nanosec
>>   netem: support delivering packets in delayed time slots
>>
>>  include/uapi/linux/pkt_sched.h |  10 +++
>>  net/sched/sch_netem.c  | 144 
>> -
>>  2 files changed, 125 insertions(+), 29 deletions(-)
>>
>
> Dave, thanks for the patch.
> One issue is that it needs to keep binary compatibility both for kernel and 
> iproute.
> That means that users of new kernel should be able to use old versions of 
> iproute
> without any visible impact (and vice versa).
>
> For the kernel, that means if new attributes are not present the old 
> attributes
> would be used.

For the kernel patchset you are commenting on, this was the case. There was no
way to send via an old iproute2 a jitter or delay value that would
trigger sending
the new attributes, although the old api has severe rounding errors as you get
down to a few us.

I just sent a kernel v2 that always sends the old and new attributes.

>For iproute2 that means send both new and old versions.

This, in the iproute2 patchset, didn't, as I assumed new versions of
iproute2 would only be used with newer kernels. I'll respin that.

I did want to somehow, one day, obsolete ticks, but I guess
that is not possible.

-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

[PATCH v2 net-next 3/3] netem: support delivering packets in delayed time slots

2017-11-08 Thread Dave Taht

Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.

It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.

The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.

The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.

Examples of use:

tc qdisc add dev eth0 root netem delay 200us \
 slot 800us 10ms bytes 64k packets 42

A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:

tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
 limit 1
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
 slot 800us 10ms bytes 64k packets 42 limit 512

Signed-off-by: Dave Taht 
---
 include/uapi/linux/pkt_sched.h |  8 +
 net/sched/sch_netem.c  | 76 --
 2 files changed, 81 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 8fe6d18..af3cc2f 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -539,6 +539,7 @@ enum {
TCA_NETEM_PAD,
TCA_NETEM_LATENCY64,
TCA_NETEM_JITTER64,
+   TCA_NETEM_SLOT,
__TCA_NETEM_MAX,
 };
 
@@ -576,6 +577,13 @@ struct tc_netem_rate {
__s32   cell_overhead;
 };
 
+struct tc_netem_slot {
+   __s64   min_delay; /* nsec */
+   __s64   max_delay;
+   __s32   max_packets;
+   __s32   max_bytes;
+};
+
 enum {
NETEM_LOSS_UNSPEC,
NETEM_LOSS_GI,  /* General Intuitive - 4 state model */
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index ef63ae4..b697f89 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -135,6 +135,13 @@ struct netem_sched_data {
u32 a5; /* p23 used only in 4-states */
} clg;
 
+   struct tc_netem_slot slot_config;
+   struct slotstate {
+   u64 slot_next;
+   s32 packets_left;
+   s32 bytes_left;
+   } slot;
+
 };
 
 /* Time stamp put into socket buffer control block
@@ -591,6 +598,20 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
return NET_XMIT_SUCCESS;
 }
 
+/* Delay the next round with a new future slot with a
+ * correct number of bytes and packets.
+ */
+
+static void get_slot_next(struct netem_sched_data *q, u64 now)
+{
+   q->slot.slot_next = now + q->slot_config.min_delay +
+   (prandom_u32() *
+   (q->slot_config.max_delay -
+   q->slot_config.min_delay) >> 32);
+   q->slot.packets_left = q->slot_config.max_packets;
+   q->slot.bytes_left = q->slot_config.max_bytes;
+}
+
 static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 {
struct netem_sched_data *q = qdisc_priv(sch);
@@ -608,14 +629,17 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
p = rb_first(&q->t_root);
if (p) {
u64 time_to_send;
+   u64 now = ktime_get_ns();
 
skb = rb_to_skb(p);
 
/* if more time remaining? */
time_to_send = netem_skb_cb(skb)->time_to_send;
-   if (time_to_send <= ktime_get_ns()) {
-   rb_erase(p, &q->t_root);
+   if (q->slot.slot_next && q->slot.slot_next < time_to_send)
+   get_slot_next(q, now);
 
+   if (time_to_send <= now &&  q->slot.slot_next <= now) {
+   rb_erase(p, &q->t_root);
sch->q.qlen--;
qdisc_qstats_backlog_dec(sch, skb);
skb->next = NULL;
@@ -634,6 +658,14 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
skb->tstamp = 0;
 #endif
 
+   if (q->slot.slot_next) {
+   q->slot.packets_left--;
+   q->slot.bytes_left -= qdisc_pkt_len(skb);
+   if (q->slot.packets_left <= 0 ||
+   q->slot.bytes_left <= 0)
+   get_slot_next(q, now);
+   }
+
if (q->qdisc) {
unsigned int pkt_len = qdisc_pkt_len(skb);
struct sk_buff *to_free = NULL;

[PATCH v2 net-next 2/3] netem: add uapi to express delay and jitter in nanoseconds

2017-11-08 Thread Dave Taht

netem userspace has long relied on a horrible /proc/net/psched hack
to translate the current notion of "ticks" to nanoseconds.

Expressing latency and jitter instead, in well defined nanoseconds,
increases the dynamic range of emulated delays and jitter in netem.

It will also ease a transition where reducing a tick to nsec
equivalence would constrain the max delay in prior versions of
netem to only 4.3 seconds.

Signed-off-by: Dave Taht 
---
 include/uapi/linux/pkt_sched.h |  2 ++
 net/sched/sch_netem.c  | 14 ++
 2 files changed, 16 insertions(+)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 6a2c5ea..8fe6d18 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -537,6 +537,8 @@ enum {
TCA_NETEM_ECN,
TCA_NETEM_RATE64,
TCA_NETEM_PAD,
+   TCA_NETEM_LATENCY64,
+   TCA_NETEM_JITTER64,
__TCA_NETEM_MAX,
 };
 
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 5559ad1..ef63ae4 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -819,6 +819,8 @@ static const struct nla_policy netem_policy[TCA_NETEM_MAX + 
1] = {
[TCA_NETEM_LOSS]= { .type = NLA_NESTED },
[TCA_NETEM_ECN] = { .type = NLA_U32 },
[TCA_NETEM_RATE64]  = { .type = NLA_U64 },
+   [TCA_NETEM_LATENCY64]   = { .type = NLA_S64 },
+   [TCA_NETEM_JITTER64]= { .type = NLA_S64 },
 };
 
 static int parse_attr(struct nlattr *tb[], int maxtype, struct nlattr *nla,
@@ -916,6 +918,12 @@ static int netem_change(struct Qdisc *sch, struct nlattr 
*opt)
q->rate = max_t(u64, q->rate,
nla_get_u64(tb[TCA_NETEM_RATE64]));
 
+   if (tb[TCA_NETEM_LATENCY64])
+   q->latency = nla_get_s64(tb[TCA_NETEM_LATENCY64]);
+
+   if (tb[TCA_NETEM_JITTER64])
+   q->jitter = nla_get_s64(tb[TCA_NETEM_JITTER64]);
+
if (tb[TCA_NETEM_ECN])
q->ecn = nla_get_u32(tb[TCA_NETEM_ECN]);
 
@@ -1020,6 +1028,12 @@ static int netem_dump(struct Qdisc *sch, struct sk_buff 
*skb)
if (nla_put(skb, TCA_OPTIONS, sizeof(qopt), &qopt))
goto nla_put_failure;
 
+   if (nla_put(skb, TCA_NETEM_LATENCY64, sizeof(q->latency), &q->latency))
+   goto nla_put_failure;
+
+   if (nla_put(skb, TCA_NETEM_JITTER64, sizeof(q->jitter), &q->jitter))
+   goto nla_put_failure;
+
cor.delay_corr = q->delay_cor.rho;
cor.loss_corr = q->loss_cor.rho;
cor.dup_corr = q->dup_cor.rho;
-- 
2.7.4

[PATCH v2 net-next 0/3] netem: add nsec scheduling and slot feature

2017-11-08 Thread Dave Taht

This patch series converts netem away from the old "ticks" interface and
userspace API, and adds support for a new "slot" feature intended to
emulate bursty macs such as WiFi and LTE better.

Changes since v1:
Always pass new nanosecond APIs to userspace

Dave Taht (3):
  netem: convert to qdisc_watchdog_schedule_ns
  netem: add uapi to express delay and jitter in nanoseconds
  netem: support delivering packets in delayed time slots

 include/uapi/linux/pkt_sched.h |  10 +++
 net/sched/sch_netem.c  | 142 -
 2 files changed, 123 insertions(+), 29 deletions(-)

-- 
2.7.4

[PATCH v2 net-next 1/3] netem: convert to qdisc_watchdog_schedule_ns

2017-11-08 Thread Dave Taht

Upgrade the internal netem scheduler to use nanoseconds rather than
ticks throughout.

Convert to and from the std "ticks" userspace api automatically,
while allowing for finer grained scheduling to take place.

Signed-off-by: Dave Taht 
---
 net/sched/sch_netem.c | 56 +--
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index db0228a..5559ad1 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -77,8 +77,8 @@ struct netem_sched_data {
 
struct qdisc_watchdog watchdog;
 
-   psched_tdiff_t latency;
-   psched_tdiff_t jitter;
+   s64 latency;
+   s64 jitter;
 
u32 loss;
u32 ecn;
@@ -145,7 +145,7 @@ struct netem_sched_data {
  * we save skb->tstamp value in skb->cb[] before destroying it.
  */
 struct netem_skb_cb {
-   psched_time_t   time_to_send;
+   u64 time_to_send;
 };
 
 static inline struct netem_skb_cb *netem_skb_cb(struct sk_buff *skb)
@@ -305,11 +305,11 @@ static bool loss_event(struct netem_sched_data *q)
  * std deviation sigma.  Uses table lookup to approximate the desired
  * distribution, and a uniformly-distributed pseudo-random source.
  */
-static psched_tdiff_t tabledist(psched_tdiff_t mu, psched_tdiff_t sigma,
-   struct crndstate *state,
-   const struct disttable *dist)
+static s64 tabledist(s64 mu, s64 sigma,
+struct crndstate *state,
+const struct disttable *dist)
 {
-   psched_tdiff_t x;
+   s64 x;
long t;
u32 rnd;
 
@@ -332,10 +332,10 @@ static psched_tdiff_t tabledist(psched_tdiff_t mu, 
psched_tdiff_t sigma,
return  x / NETEM_DIST_SCALE + (sigma / NETEM_DIST_SCALE) * t + mu;
 }
 
-static psched_time_t packet_len_2_sched_time(unsigned int len, struct 
netem_sched_data *q)
+static s64 packet_len_2_sched_time(unsigned int len,
+  struct netem_sched_data *q)
 {
-   u64 ticks;
-
+   s64 offset;
len += q->packet_overhead;
 
if (q->cell_size) {
@@ -345,11 +345,9 @@ static psched_time_t packet_len_2_sched_time(unsigned int 
len, struct netem_sche
cells++;
len = cells * (q->cell_size + q->cell_overhead);
}
-
-   ticks = (u64)len * NSEC_PER_SEC;
-
-   do_div(ticks, q->rate);
-   return PSCHED_NS2TICKS(ticks);
+   offset = (s64)len * NSEC_PER_SEC;
+   do_div(offset, q->rate);
+   return offset;
 }
 
 static void tfifo_reset(struct Qdisc *sch)
@@ -369,7 +367,7 @@ static void tfifo_reset(struct Qdisc *sch)
 static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch)
 {
struct netem_sched_data *q = qdisc_priv(sch);
-   psched_time_t tnext = netem_skb_cb(nskb)->time_to_send;
+   u64 tnext = netem_skb_cb(nskb)->time_to_send;
struct rb_node **p = &q->t_root.rb_node, *parent = NULL;
 
while (*p) {
@@ -515,13 +513,13 @@ static int netem_enqueue(struct sk_buff *skb, struct 
Qdisc *sch,
if (q->gap == 0 ||  /* not doing reordering */
q->counter < q->gap - 1 ||  /* inside last reordering gap */
q->reorder < get_crandom(&q->reorder_cor)) {
-   psched_time_t now;
-   psched_tdiff_t delay;
+   u64 now;
+   s64 delay;
 
delay = tabledist(q->latency, q->jitter,
  &q->delay_cor, q->delay_dist);
 
-   now = psched_get_time();
+   now = ktime_get_ns();
 
if (q->rate) {
struct netem_skb_cb *last = NULL;
@@ -547,7 +545,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
 * from delay.
 */
delay -= last->time_to_send - now;
-   delay = max_t(psched_tdiff_t, 0, delay);
+   delay = max_t(s64, 0, delay);
now = last->time_to_send;
}
 
@@ -562,7 +560,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
 * Do re-ordering by putting one out of N packets at the front
 * of the queue.
 */
-   cb->time_to_send = psched_get_time();
+   cb->time_to_send = ktime_get_ns();
q->counter = 0;
 
netem_enqueue_skb_head(&sch->q, skb);
@@ -609,13 +607,13 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
}
p = rb_first(&q->t_root);
if (p) {
-   psched_time_t time_to_send;
+   u64 time_to_send;
 
skb = rb_to_skb(p);

[PATCH iproute2 2/4] q_netem: utilize 64 bit nanosecond API for delay and jitter

2017-11-07 Thread Dave Taht

This starts to obsolete the old "ticks" api in favor of well
defined nanoseconds for the kernel/userspace netem interface.

Signed-off-by: Dave Taht 
---
 tc/q_netem.c | 68 +++-
 1 file changed, 44 insertions(+), 24 deletions(-)

diff --git a/tc/q_netem.c b/tc/q_netem.c
index cdaddce..22a5b94 100644
--- a/tc/q_netem.c
+++ b/tc/q_netem.c
@@ -151,24 +151,6 @@ static int get_distribution(const char *type, __s16 *data, 
int maxdata)
 #define NEXT_IS_SIGNED_NUMBER() \
(NEXT_ARG_OK() && (isdigit(argv[1][0]) || argv[1][0] == '-'))
 
-/* Adjust for the fact that psched_ticks aren't always usecs
-   (based on kernel PSCHED_CLOCK configuration */
-static int get_ticks(__u32 *ticks, const char *str)
-{
-   unsigned int t;
-
-   if (get_time(&t, str))
-   return -1;
-
-   if (tc_core_time2big(t)) {
-   fprintf(stderr, "Illegal %u time (too large)\n", t);
-   return -1;
-   }
-
-   *ticks = tc_core_time2tick(t);
-   return 0;
-}
-
 static int netem_parse_opt(struct qdisc_util *qu, int argc, char **argv,
   struct nlmsghdr *n)
 {
@@ -185,6 +167,8 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, 
char **argv,
__u16 loss_type = NETEM_LOSS_UNSPEC;
int present[__TCA_NETEM_MAX] = {};
__u64 rate64 = 0;
+   __s64 latency64 = 0;
+   __s64 jitter64 = 0;
 
for ( ; argc > 0; --argc, ++argv) {
if (matches(*argv, "limit") == 0) {
@@ -196,14 +180,16 @@ static int netem_parse_opt(struct qdisc_util *qu, int 
argc, char **argv,
} else if (matches(*argv, "latency") == 0 ||
   matches(*argv, "delay") == 0) {
NEXT_ARG();
-   if (get_ticks(&opt.latency, *argv)) {
+   present[TCA_NETEM_LATENCY64] = 1;
+   if (get_time64(&latency64, *argv)) {
explain1("latency");
return -1;
}
 
if (NEXT_IS_NUMBER()) {
NEXT_ARG();
-   if (get_ticks(&opt.jitter, *argv)) {
+   present[TCA_NETEM_JITTER64] = 1;
+   if (get_time64(&jitter64, *argv)) {
explain1("latency");
return -1;
}
@@ -437,7 +423,7 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, 
char **argv,
tail = NLMSG_TAIL(n);
 
if (reorder.probability) {
-   if (opt.latency == 0) {
+   if (latency64 == 0) {
fprintf(stderr, "reordering not possible without 
specifying some delay\n");
explain();
return -1;
@@ -458,7 +444,7 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, 
char **argv,
}
}
 
-   if (dist_data && (opt.latency == 0 || opt.jitter == 0)) {
+   if (dist_data && (latency64 == 0 || jitter64 == 0)) {
fprintf(stderr, "distribution specified but no latency and 
jitter values\n");
explain();
return -1;
@@ -467,6 +453,16 @@ static int netem_parse_opt(struct qdisc_util *qu, int 
argc, char **argv,
if (addattr_l(n, 1024, TCA_OPTIONS, &opt, sizeof(opt)) < 0)
return -1;
 
+   if (present[TCA_NETEM_LATENCY64] &&
+   addattr_l(n, 1024, TCA_NETEM_LATENCY64, &latency64,
+ sizeof(latency64)) < 0)
+   return -1;
+
+   if (present[TCA_NETEM_JITTER64] &&
+   addattr_l(n, 1024, TCA_NETEM_JITTER64, &jitter64,
+ sizeof(jitter64)) < 0)
+   return -1;
+
if (present[TCA_NETEM_CORR] &&
addattr_l(n, 1024, TCA_NETEM_CORR, &cor, sizeof(cor)) < 0)
return -1;
@@ -540,6 +536,8 @@ static int netem_print_opt(struct qdisc_util *qu, FILE *f, 
struct rtattr *opt)
const struct tc_netem_rate *rate = NULL;
int len;
__u64 rate64 = 0;
+   __s64 latency64 = 0;
+   __s64 jitter64 = 0;
 
SPRINT_BUF(b1);
 
@@ -559,6 +557,18 @@ static int netem_print_opt(struct qdisc_util *qu, FILE *f, 
struct rtattr *opt)
parse_rtattr(tb, TCA_NETEM_MAX, RTA_DATA(opt) + sizeof(qopt),
 len);
 
+   if (tb[TCA_NETEM_LATENCY64]) {
+   if (RTA_PAYLOAD(tb[TCA_NETEM_LATENCY64]) <
+   sizeof(latency64))
+   return -1;
+   latency64 = rta_geta

[PATCH iproute2 4/4] netem: add documentation for the new slotting feature

2017-11-07 Thread Dave Taht

Signed-off-by: Dave Taht 
---
 man/man8/tc-netem.8 | 32 +++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/man/man8/tc-netem.8 b/man/man8/tc-netem.8
index b31384f..8c41d77 100644
--- a/man/man8/tc-netem.8
+++ b/man/man8/tc-netem.8
@@ -8,7 +8,8 @@ NetEm \- Network Emulator
 .I OPTIONS
 
 .IR OPTIONS " := [ " LIMIT " ] [ " DELAY " ] [ " LOSS \
-" ] [ " CORRUPT " ] [ " DUPLICATION " ] [ " REORDERING " ][ " RATE " ]"
+" ] [ " CORRUPT " ] [ " DUPLICATION " ] [ " REORDERING " ] [ " RATE \
+" ] [ " SLOT " ]"
 
 .IR LIMIT " := "
 .B limit
@@ -51,6 +52,14 @@ NetEm \- Network Emulator
 .B rate
 .IR RATE " [ " PACKETOVERHEAD " [ " CELLSIZE " [ " CELLOVERHEAD " "
 
+.IR SLOT " := "
+.BR slot
+.IR MIN_DELAY " [ " MAX_DELAY " ] {["
+.BR packets
+.IR PACKETS " ] [ "
+.BR bytes
+.IR BYTES " ] }]"
+
 
 .SH DESCRIPTION
 NetEm is an enhancement of the Linux traffic control facilities
@@ -162,6 +171,27 @@ granularity avoid a perfect shaping at a specific level. 
This will show up in
 an artificial packet compression (bursts). Another influence factor are network
 adapter buffers which can also add artificial delay.
 
+.SS slot
+defer delivering accumulated packets to within a slot, with each available slot
+configured with a minimum delay to acquire, and an optional maximum delay.  
Slot
+delays can be specified in nanoseconds, microseconds, milliseconds or seconds
+(e.g. 800us). Values for the optional parameters
+.I BYTES
+will limit the number of bytes delivered per slot, and/or
+.I PACKETS
+will limit the number of packets delivered per slot.
+
+These slot options can provide a crude approximation of bursty MACs such as
+DOCSIS, WiFi, and LTE.
+
+Note that slotting is limited by several factors: the kernel clock granularity,
+as with a rate, and attempts to deliver many packets within a slot will be
+smeared by the timer resolution, and by the underlying native bandwidth also.
+
+It is possible to combine slotting with a rate, in which case complex behaviors
+where either the rate, or the slot limits on bytes or packets per slot, govern
+the actual delivered rate.
+
 .SH LIMITATIONS
 The main known limitation of Netem are related to timer granularity, since
 Linux is not a real-time operating system.
-- 
2.7.4

[PATCH iproute2 1/4] tc: support conversions to or from 64 bit nanosecond-based time

2017-11-07 Thread Dave Taht

Using a 32 bit field to represent time in nanoseconds results in a
maximum value of about 4.3 seconds, which is well below many observed
delays in WiFi and LTE, and barely in the ballpark for a trip past the
Earth's moon, Luna.

Using 64 bit time fields in nanoseconds allows us to simulate
network diameters of several hundred light-years. However, only
conversions to and from ns, us, ms, and seconds are provided.

Signed-off-by: Dave Taht 
---
 tc/tc_core.h |  3 +++
 tc/tc_util.c | 56 
 tc/tc_util.h |  3 +++
 3 files changed, 62 insertions(+)

diff --git a/tc/tc_core.h b/tc/tc_core.h
index 8a63b79..e8bda8b 100644
--- a/tc/tc_core.h
+++ b/tc/tc_core.h
@@ -5,6 +5,9 @@
 #include 
 
 #define TIME_UNITS_PER_SEC 100
+#define USEC_PER_SEC 1000
+#define MSEC_PER_SEC (1000 * 1000)
+#define NSEC_PER_SEC (1000 * 1000 * 1000)
 
 enum link_layer {
LINKLAYER_UNSPEC,
diff --git a/tc/tc_util.c b/tc/tc_util.c
index b39e550..dafbbc3 100644
--- a/tc/tc_util.c
+++ b/tc/tc_util.c
@@ -323,6 +323,62 @@ char *sprint_ticks(__u32 ticks, char *buf)
return sprint_time(tc_core_tick2time(ticks), buf);
 }
 
+/* 64 bit times are represented internally in nanoseconds */
+
+int get_time64(__s64 *time, const char *str)
+{
+   double t;
+   char *p;
+
+   t = strtod(str, &p);
+   if (p == str)
+   return -1;
+
+   if (*p) {
+   if (strcasecmp(p, "s") == 0 ||
+   strcasecmp(p, "sec") == 0 ||
+   strcasecmp(p, "secs") == 0)
+   t *= NSEC_PER_SEC;
+   else if (strcasecmp(p, "ms") == 0 ||
+strcasecmp(p, "msec") == 0 ||
+strcasecmp(p, "msecs") == 0)
+   t *= MSEC_PER_SEC;
+   else if (strcasecmp(p, "us") == 0 ||
+strcasecmp(p, "usec") == 0 ||
+strcasecmp(p, "usecs") == 0)
+   t *= USEC_PER_SEC;
+   else if (strcasecmp(p, "ns") == 0 ||
+strcasecmp(p, "nsec") == 0 ||
+strcasecmp(p, "nsecs") == 0)
+   t *= 1;
+   else
+   return -1;
+   }
+
+   *time = t;
+   return 0;
+}
+
+void print_time64(char *buf, int len, __s64 time)
+{
+   double tmp = time;
+
+   if (time >= NSEC_PER_SEC)
+   snprintf(buf, len, "%.3fs", tmp/NSEC_PER_SEC);
+   else if (time >= MSEC_PER_SEC)
+   snprintf(buf, len, "%.3fms", tmp/MSEC_PER_SEC);
+   else if (time >= USEC_PER_SEC)
+   snprintf(buf, len, "%.3fus", tmp/USEC_PER_SEC);
+   else
+   snprintf(buf, len, "%lldns", time);
+}
+
+char *sprint_time64(__s64 time, char *buf)
+{
+   print_time64(buf, SPRINT_BSIZE-1, time);
+   return buf;
+}
+
 int get_size(unsigned int *size, const char *str)
 {
double sz;
diff --git a/tc/tc_util.h b/tc/tc_util.h
index 583a21a..7a8559b 100644
--- a/tc/tc_util.h
+++ b/tc/tc_util.h
@@ -72,12 +72,14 @@ int get_rate64(__u64 *rate, const char *str);
 int get_size(unsigned int *size, const char *str);
 int get_size_and_cell(unsigned int *size, int *cell_log, char *str);
 int get_time(unsigned int *time, const char *str);
+int get_time64(__s64 *time, const char *str);
 int get_linklayer(unsigned int *val, const char *arg);
 
 void print_rate(char *buf, int len, __u64 rate);
 void print_size(char *buf, int len, __u32 size);
 void print_qdisc_handle(char *buf, int len, __u32 h);
 void print_time(char *buf, int len, __u32 time);
+void print_time64(char *buf, int len, __s64 time);
 void print_linklayer(char *buf, int len, unsigned int linklayer);
 
 char *sprint_rate(__u64 rate, char *buf);
@@ -85,6 +87,7 @@ char *sprint_size(__u32 size, char *buf);
 char *sprint_qdisc_handle(__u32 h, char *buf);
 char *sprint_tc_classid(__u32 h, char *buf);
 char *sprint_time(__u32 time, char *buf);
+char *sprint_time64(__s64 time, char *buf);
 char *sprint_ticks(__u32 ticks, char *buf);
 char *sprint_linklayer(unsigned int linklayer, char *buf);
 
-- 
2.7.4

[PATCH iproute2 3/4] q_netem: support delivering packets in delayed time slots

2017-11-07 Thread Dave Taht

Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.

It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.

The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.

The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.

Examples of use:

tc qdisc add dev eth0 root netem delay 200us \
slot 800us 10ms bytes 64k packets 42

A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:

tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
 limit 1
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
 slot 800us 10ms bytes 64k packets 42 limit 512

Signed-off-by: Dave Taht 
---
 tc/q_netem.c | 55 ++-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/tc/q_netem.c b/tc/q_netem.c
index 22a5b94..c606cfd 100644
--- a/tc/q_netem.c
+++ b/tc/q_netem.c
@@ -40,7 +40,10 @@ static void explain(void)
 " [ loss gemodel PERCENT [R [1-H [1-K]]]\n" \
 " [ ecn ]\n" \
 " [ reorder PRECENT [CORRELATION] [ gap DISTANCE ]]\n" \
-" [ rate RATE [PACKETOVERHEAD] [CELLSIZE] [CELLOVERHEAD]]\n");
+" [ rate RATE [PACKETOVERHEAD] [CELLSIZE] [CELLOVERHEAD]]\n" \
+" [ slot MIN_DELAY MAX_DELAY [packets MAX_PACKETS]" \
+" [bytes MAX_BYTES]]\n" \
+   );
 }
 
 static void explain1(const char *arg)
@@ -163,6 +166,7 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, 
char **argv,
struct tc_netem_gimodel gimodel;
struct tc_netem_gemodel gemodel;
struct tc_netem_rate rate = {};
+   struct tc_netem_slot slot = {};
__s16 *dist_data = NULL;
__u16 loss_type = NETEM_LOSS_UNSPEC;
int present[__TCA_NETEM_MAX] = {};
@@ -410,6 +414,36 @@ static int netem_parse_opt(struct qdisc_util *qu, int 
argc, char **argv,
return -1;
}
}
+   } else if (matches(*argv, "slot") == 0) {
+   NEXT_ARG();
+   present[TCA_NETEM_SLOT] = 1;
+   if (get_time64(&slot.min_delay, *argv)) {
+   explain1("slot min_delay");
+   return -1;
+   }
+   if (NEXT_IS_NUMBER()) {
+   NEXT_ARG();
+   if (get_time64(&slot.max_delay, *argv)) {
+   explain1("slot min_delay max_delay");
+   return -1;
+   }
+   }
+   if (slot.max_delay < slot.min_delay)
+   slot.max_delay = slot.min_delay;
+   } else if (matches(*argv, "packets") == 0) {
+   NEXT_ARG();
+   if (get_s32(&slot.max_packets, *argv, 0)) {
+   explain1("slot packets");
+   return -1;
+   }
+   } else if (matches(*argv, "bytes") == 0) {
+   unsigned int max_bytes;
+   NEXT_ARG();
+   if (get_size(&max_bytes, *argv)) {
+   explain1("slot bytes");
+   return -1;
+   }
+   slot.max_bytes = (int) max_bytes;
} else if (strcmp(*argv, "help") == 0) {
explain();
return -1;
@@ -480,6 +514,10 @@ static int netem_parse_opt(struct qdisc_util *qu, int 
argc, char **argv,
addattr_l(n, 1024, TCA_NETEM_CORRUPT, &corrupt, sizeof(corrupt)) < 
0)
return -1;
 
+   if (present[TCA_NETEM_SLOT] &&
+   addattr_l(n, 1024, TCA_NETEM_SLOT, &slot, sizeof(slot)) < 0)
+   return -1;
+
if (loss_type != NETEM_LOSS_UNSPEC) {
struct rtattr *start;
 
@@ -534,6 +572,7 @@ static int netem_print_opt(struct qdisc_util *qu, FILE *f, 
struct rtattr *opt)
int *ecn = NULL;
struct tc_netem_qopt qopt;
const struct tc_netem_rate *rate = NULL;

[PATCH iproute2 0/4] support nsec uapi and add netem slotting feature

2017-11-07 Thread Dave Taht


This patch series adds support for specifying time in nanoseconds
to tc, updates netem to use such (requires pkt_sched.h from the kernel
patch series to be imported), and lastly adds a new slotting
feature intended to emulate better how LTE and WiFi work.

Dave Taht (4):
  tc: support conversions to or from 64 bit nanosecond-based time
  q_netem: utilize 64 bit nanosecond API for delay and jitter
  q_netem: support delivering packets in delayed time slots
  netem: add documentation for the new slotting feature

 man/man8/tc-netem.8 |  32 +-
 tc/q_netem.c| 123 +---
 tc/tc_core.h|   3 ++
 tc/tc_util.c|  56 
 tc/tc_util.h|   3 ++
 5 files changed, 191 insertions(+), 26 deletions(-)

-- 
2.7.4

[PATCH net-next 1/3] netem: convert to qdisc_watchdog_schedule_ns

2017-11-07 Thread Dave Taht

Upgrade the internal netem scheduler to use nanoseconds rather than
ticks throughout.

Convert to and from the std "ticks" userspace api automatically,
while allowing for finer grained scheduling to take place.

Signed-off-by: Dave Taht 
---
 net/sched/sch_netem.c | 56 +--
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index db0228a..443a75d 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -77,8 +77,8 @@ struct netem_sched_data {
 
struct qdisc_watchdog watchdog;
 
-   psched_tdiff_t latency;
-   psched_tdiff_t jitter;
+   s64 latency;
+   s64 jitter;
 
u32 loss;
u32 ecn;
@@ -145,7 +145,7 @@ struct netem_sched_data {
  * we save skb->tstamp value in skb->cb[] before destroying it.
  */
 struct netem_skb_cb {
-   psched_time_t   time_to_send;
+   u64 time_to_send;
 };
 
 static inline struct netem_skb_cb *netem_skb_cb(struct sk_buff *skb)
@@ -305,11 +305,11 @@ static bool loss_event(struct netem_sched_data *q)
  * std deviation sigma.  Uses table lookup to approximate the desired
  * distribution, and a uniformly-distributed pseudo-random source.
  */
-static psched_tdiff_t tabledist(psched_tdiff_t mu, psched_tdiff_t sigma,
-   struct crndstate *state,
-   const struct disttable *dist)
+static s64 tabledist(s64 mu, s64 sigma,
+struct crndstate *state,
+const struct disttable *dist)
 {
-   psched_tdiff_t x;
+   s64 x;
long t;
u32 rnd;
 
@@ -332,10 +332,10 @@ static psched_tdiff_t tabledist(psched_tdiff_t mu, 
psched_tdiff_t sigma,
return  x / NETEM_DIST_SCALE + (sigma / NETEM_DIST_SCALE) * t + mu;
 }
 
-static psched_time_t packet_len_2_sched_time(unsigned int len, struct 
netem_sched_data *q)
+static s64 packet_len_2_sched_time(unsigned int len,
+  struct netem_sched_data *q)
 {
-   u64 ticks;
-
+   s64 offset;
len += q->packet_overhead;
 
if (q->cell_size) {
@@ -345,11 +345,9 @@ static psched_time_t packet_len_2_sched_time(unsigned int 
len, struct netem_sche
cells++;
len = cells * (q->cell_size + q->cell_overhead);
}
-
-   ticks = (u64)len * NSEC_PER_SEC;
-
-   do_div(ticks, q->rate);
-   return PSCHED_NS2TICKS(ticks);
+   offset = (s64)len * NSEC_PER_SEC;
+   do_div(offset, q->rate);
+   return offset;
 }
 
 static void tfifo_reset(struct Qdisc *sch)
@@ -369,7 +367,7 @@ static void tfifo_reset(struct Qdisc *sch)
 static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch)
 {
struct netem_sched_data *q = qdisc_priv(sch);
-   psched_time_t tnext = netem_skb_cb(nskb)->time_to_send;
+   u64 tnext = netem_skb_cb(nskb)->time_to_send;
struct rb_node **p = &q->t_root.rb_node, *parent = NULL;
 
while (*p) {
@@ -515,13 +513,13 @@ static int netem_enqueue(struct sk_buff *skb, struct 
Qdisc *sch,
if (q->gap == 0 ||  /* not doing reordering */
q->counter < q->gap - 1 ||  /* inside last reordering gap */
q->reorder < get_crandom(&q->reorder_cor)) {
-   psched_time_t now;
-   psched_tdiff_t delay;
+   u64 now;
+   s64 delay;
 
delay = tabledist(q->latency, q->jitter,
  &q->delay_cor, q->delay_dist);
 
-   now = psched_get_time();
+   now = ktime_get_ns();
 
if (q->rate) {
struct netem_skb_cb *last = NULL;
@@ -547,7 +545,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
 * from delay.
 */
delay -= last->time_to_send - now;
-   delay = max_t(psched_tdiff_t, 0, delay);
+   delay = max_t(s64, 0, delay);
now = last->time_to_send;
}
 
@@ -562,7 +560,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
 * Do re-ordering by putting one out of N packets at the front
 * of the queue.
 */
-   cb->time_to_send = psched_get_time();
+   cb->time_to_send = ktime_get_ns();
q->counter = 0;
 
netem_enqueue_skb_head(&sch->q, skb);
@@ -609,13 +607,13 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
}
p = rb_first(&q->t_root);
if (p) {
-   psched_time_t time_to_send;
+   u64 time_to_send;
 
skb = rb_t

[PATCH net-next 0/3] netem: add nsec scheduling and slot feature

2017-11-07 Thread Dave Taht

This patch series converts netem away from the old "ticks" interface and
userspace API, and adds support for a new "slot" feature intended to
emulate bursty macs such as WiFi and LTE better.

Dave Taht (3):
  netem: convert to qdisc_watchdog_schedule_ns
  netem: add uapi to express delay and jitter in nanosec
  netem: support delivering packets in delayed time slots

 include/uapi/linux/pkt_sched.h |  10 +++
 net/sched/sch_netem.c  | 144 -
 2 files changed, 125 insertions(+), 29 deletions(-)

-- 
2.7.4

[PATCH net-next 3/3] netem: support delivering packets in delayed time slots

2017-11-07 Thread Dave Taht

Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.

It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.

The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.

The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.

Examples of use:

tc qdisc add dev eth0 root netem delay 200us \
 slot 800us 10ms bytes 64k packets 42

A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:

tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
 limit 1
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
 slot 800us 10ms bytes 64k packets 42 limit 512

Signed-off-by: Dave Taht 
---
 include/uapi/linux/pkt_sched.h |  8 +
 net/sched/sch_netem.c  | 76 --
 2 files changed, 81 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 20cfd64..37b5096 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -538,6 +538,7 @@ enum {
TCA_NETEM_PAD,
TCA_NETEM_LATENCY64,
TCA_NETEM_JITTER64,
+   TCA_NETEM_SLOT,
__TCA_NETEM_MAX,
 };
 
@@ -575,6 +576,13 @@ struct tc_netem_rate {
__s32   cell_overhead;
 };
 
+struct tc_netem_slot {
+   __s64   min_delay; /* nsec */
+   __s64   max_delay;
+   __s32   max_packets;
+   __s32   max_bytes;
+};
+
 enum {
NETEM_LOSS_UNSPEC,
NETEM_LOSS_GI,  /* General Intuitive - 4 state model */
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 16c4813..a7189f9 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -135,6 +135,13 @@ struct netem_sched_data {
u32 a5; /* p23 used only in 4-states */
} clg;
 
+   struct tc_netem_slot slot_config;
+   struct slotstate {
+   u64 slot_next;
+   s32 packets_left;
+   s32 bytes_left;
+   } slot;
+
 };
 
 /* Time stamp put into socket buffer control block
@@ -591,6 +598,20 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
return NET_XMIT_SUCCESS;
 }
 
+/* Delay the next round with a new future slot with a
+ * correct number of bytes and packets.
+ */
+
+static void get_slot_next(struct netem_sched_data *q, u64 now)
+{
+   q->slot.slot_next = now + q->slot_config.min_delay +
+   (prandom_u32() *
+   (q->slot_config.max_delay -
+   q->slot_config.min_delay) >> 32);
+   q->slot.packets_left = q->slot_config.max_packets;
+   q->slot.bytes_left = q->slot_config.max_bytes;
+}
+
 static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 {
struct netem_sched_data *q = qdisc_priv(sch);
@@ -608,14 +629,17 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
p = rb_first(&q->t_root);
if (p) {
u64 time_to_send;
+   u64 now = ktime_get_ns();
 
skb = rb_to_skb(p);
 
/* if more time remaining? */
time_to_send = netem_skb_cb(skb)->time_to_send;
-   if (time_to_send <= ktime_get_ns()) {
-   rb_erase(p, &q->t_root);
+   if (q->slot.slot_next && q->slot.slot_next < time_to_send)
+   get_slot_next(q, now);
 
+   if (time_to_send <= now &&  q->slot.slot_next <= now) {
+   rb_erase(p, &q->t_root);
sch->q.qlen--;
qdisc_qstats_backlog_dec(sch, skb);
skb->next = NULL;
@@ -634,6 +658,14 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
skb->tstamp = 0;
 #endif
 
+   if (q->slot.slot_next) {
+   q->slot.packets_left--;
+   q->slot.bytes_left -= qdisc_pkt_len(skb);
+   if (q->slot.packets_left <= 0 ||
+   q->slot.bytes_left <= 0)
+   get_slot_next(q, now);
+   }
+
if (q->qdisc) {
unsigned int pkt_len = qdisc_pkt_len(skb);
struct sk_buff *to_free = NULL;

[PATCH net-next 2/3] netem: add uapi to express delay and jitter in nanosec

2017-11-07 Thread Dave Taht

netem userspace has long relied on a horrible /proc/net/psched hack to
translate the current notion of "ticks" to nanoseconds.

Expressing latency and jitter instead, in well defined nanoseconds,
increases the dynamic range of emulated delays and jitter in netem.

It will also ease a transition where reducing a tick to nsec equivalence
would constrain the max delay in prior versions of netem to only 4.3
seconds.

Signed-off-by: Dave Taht 
---
 include/uapi/linux/pkt_sched.h |  2 ++
 net/sched/sch_netem.c  | 16 
 2 files changed, 18 insertions(+)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 5002562..20cfd64 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -536,6 +536,8 @@ enum {
TCA_NETEM_ECN,
TCA_NETEM_RATE64,
TCA_NETEM_PAD,
+   TCA_NETEM_LATENCY64,
+   TCA_NETEM_JITTER64,
__TCA_NETEM_MAX,
 };
 
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 443a75d..16c4813 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -819,6 +819,8 @@ static const struct nla_policy netem_policy[TCA_NETEM_MAX + 
1] = {
[TCA_NETEM_LOSS]= { .type = NLA_NESTED },
[TCA_NETEM_ECN] = { .type = NLA_U32 },
[TCA_NETEM_RATE64]  = { .type = NLA_U64 },
+   [TCA_NETEM_LATENCY64]   = { .type = NLA_S64 },
+   [TCA_NETEM_JITTER64]= { .type = NLA_S64 },
 };
 
 static int parse_attr(struct nlattr *tb[], int maxtype, struct nlattr *nla,
@@ -916,6 +918,12 @@ static int netem_change(struct Qdisc *sch, struct nlattr 
*opt)
q->rate = max_t(u64, q->rate,
nla_get_u64(tb[TCA_NETEM_RATE64]));
 
+   if (tb[TCA_NETEM_LATENCY64])
+   q->latency = nla_get_s64(tb[TCA_NETEM_LATENCY64]);
+
+   if (tb[TCA_NETEM_JITTER64])
+   q->jitter = nla_get_s64(tb[TCA_NETEM_JITTER64]);
+
if (tb[TCA_NETEM_ECN])
q->ecn = nla_get_u32(tb[TCA_NETEM_ECN]);
 
@@ -1020,6 +1028,14 @@ static int netem_dump(struct Qdisc *sch, struct sk_buff 
*skb)
if (nla_put(skb, TCA_OPTIONS, sizeof(qopt), &qopt))
goto nla_put_failure;
 
+   if (PSCHED_TICKS2NS(qopt.latency) != q->latency)
+   if (nla_put(skb, TCA_NETEM_LATENCY64, sizeof(q->latency),
+   &q->latency))
+   goto nla_put_failure;
+   if (PSCHED_TICKS2NS(qopt.jitter) != q->jitter)
+   if (nla_put(skb, TCA_NETEM_JITTER64, sizeof(q->jitter),
+   &q->jitter))
+   goto nla_put_failure;
cor.delay_corr = q->delay_cor.rho;
cor.loss_corr = q->loss_cor.rho;
cor.dup_corr = q->dup_cor.rho;
-- 
2.7.4

Re: Oops with HTB on net-next

2017-11-02 Thread Dave Taht

On Thu, Nov 2, 2017 at 11:09 AM, Cong Wang  wrote:
> On Wed, Nov 1, 2017 at 1:17 PM, Dave Taht  wrote:
>>
>> That is not in net-next, and the "net" version of that one patch does
>> not apply to net-next. The relevant thread says "... another fun merge
>> into net-next".
>>
>> Please let me know when the fun is done, and I'll retest.
>
> -net is merged into -net-next now.

retested with net-next as of commit: 2d2faaf0568b4946d9abeb4e541227b4ca259840

Run after boot, with the system fairly idle, sqm-scripts works in
setting up htb, fq_codel, filters, iptables rules, etc.

If I run sqm-scripts in early boot, (run out of
/etc/network/if-up.d/sqm) with all the other stuff going on then, it
still fails.

sqm does lots of complicated stuff in rapid succession, and I'm not
sure how to go about reproducing this more simply than saying  grab
those, and hand them conf files for one existing and one non-existing
device.

I'll try to make it happen at later times, and try ripping out (for
example) the ifb setup and tc_mirred, etc, for the early boot
scenario. Can you suggest other means of debugging?

sqm-scripts repo: https://github.com/tohojo/sqm-scripts
my current sqm conf files: http://www.taht.net/~d/mysqmconfig.tgz
my current net-next kernel config:
http://www.taht.net/~d/mytcffailingkernel.config

Here's a complete dmesg of the most recent failure:

[0.00] Linux version 4.14.0-rc7-netem-4 (dave@nemesis) (gcc
version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)) #1 SMP PREEMPT
Thu Nov 2 14:19:17 PDT 2017
[0.00] Command line:
BOOT_IMAGE=/boot/vmlinuz-4.14.0-rc7-netem-4
root=UUID=ab3ceeb5-6d85-4c1c-8e0a-7e9e10949bae ro quiet splash
vt.handoff=7
[0.00] KERNEL supported cpus:
[0.00]   Intel GenuineIntel
[0.00]   AMD AuthenticAMD
[0.00]   Centaur CentaurHauls
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating
point registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: Enabled xstate features 0x7, context size is
832 bytes, using 'standard' format.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x00057fff] usable
[0.00] BIOS-e820: [mem 0x00058000-0x00058fff] reserved
[0.00] BIOS-e820: [mem 0x00059000-0x0009efff] usable
[0.00] BIOS-e820: [mem 0x0009f000-0x0009] reserved
[0.00] BIOS-e820: [mem 0x0010-0xa7edafff] usable
[0.00] BIOS-e820: [mem 0xa7edb000-0xa81bbfff] reserved
[0.00] BIOS-e820: [mem 0xa81bc000-0xabc08fff] usable
[0.00] BIOS-e820: [mem 0xabc09000-0xabc67fff] reserved
[0.00] BIOS-e820: [mem 0xabc68000-0xabfd4fff] usable
[0.00] BIOS-e820: [mem 0xabfd5000-0xacd4efff] ACPI NVS
[0.00] BIOS-e820: [mem 0xacd4f000-0xacfa9fff] reserved
[0.00] BIOS-e820: [mem 0xacfaa000-0xacffefff] type 20
[0.00] BIOS-e820: [mem 0xacfff000-0xacff] usable
[0.00] BIOS-e820: [mem 0xad80-0xafff] reserved
[0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved
[0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00014eff] usable
[0.00] NX (Execute Disable) protection: active
[0.00] efi: EFI v2.40 by American Megatrends
[0.00] efi:  ESRT=0xacfa6018  ACPI=0xac71b000  ACPI
2.0=0xac71b000  SMBIOS=0xf05b0  MPS=0xfd640
[0.00] random: fast init done
[0.00] SMBIOS 2.8 present.
[0.00] DMI: Intel Corporation SharkBay Platform/WhiteTip
Mountain1 Fab2, BIOS 5.6.5 06/18/2015
[0.00] tsc: Fast TSC calibration using PIT
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0x14f000 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 00

Re: Oops with HTB on net-next

2017-11-01 Thread Dave Taht

On Wed, Nov 1, 2017 at 9:04 AM, Cong Wang  wrote:
> On Tue, Oct 31, 2017 at 11:42 PM, Dave Taht  wrote:
>> I am using a fairly complex htb + three tiers of fq_codel and a couple
>> tc filters (it's the sqm-scripts "simple.qos" model). I rebased on
>> net-next head an hour ago, and
>>
>> [8.357963] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0: link becomes ready
>> [9.759942] u32 classifier
>> [9.759944] Actions configured
>> [9.762152] Mirror/redirect action on
>> [9.859138] BUG: unable to handle kernel NULL pointer dereference
>> at 0018
>> [9.859149] IP: tcf_block_put+0xf/0x50
>
> Do you have this fix?
>
> commit 822e86d997e4d8f942818ea6ac1711f59a66ebef
> Author: Cong Wang 
> Date:   Mon Oct 30 11:10:09 2017 -0700
>
> net_sched: remove tcf_block_put_deferred()

No.

That is not in net-next, and the "net" version of that one patch does
not apply to net-next. The relevant thread says "... another fun merge
into net-next".

Please let me know when the fun is done, and I'll retest.

-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

Oops with HTB on net-next

2017-10-31 Thread Dave Taht

I am using a fairly complex htb + three tiers of fq_codel and a couple
tc filters (it's the sqm-scripts "simple.qos" model). I rebased on
net-next head an hour ago, and

[8.357963] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0: link becomes ready
[9.759942] u32 classifier
[9.759944] Actions configured
[9.762152] Mirror/redirect action on
[9.859138] BUG: unable to handle kernel NULL pointer dereference
at 0018
[9.859149] IP: tcf_block_put+0xf/0x50
[9.859151] PGD 144e8e067 P4D 144e8e067 PUD 144ea3067 PMD 0
[9.859155] Oops:  [#1] PREEMPT SMP
[9.859157] Modules linked in: act_mirred cls_u32 sch_ingress
sch_fq_codel sch_htb ifb bnep binfmt_misc nls_iso8859_1 arc4
intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp
snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic
kvm_intel ath9k snd_hda_intel kvm ath9k_common ath9k_hw irqbypass
snd_hda_codec crct10dif_pclmul ath snd_hda_core crc32_pclmul
ghash_clmulni_intel snd_hwdep mac80211 uvcvideo pcbc videobuf2_vmalloc
videobuf2_memops snd_pcm videobuf2_v4l2 videobuf2_core aesni_intel
snd_seq_midi ath3k snd_seq_midi_event aes_x86_64 crypto_simd
glue_helper cfg80211 snd_rawmidi cryptd videodev btusb serio_raw btrtl
snd_seq btbcm btintel snd_seq_device media bluetooth input_leds
snd_timer lpc_ich ecdh_generic snd mei_me soundcore mei shpchp
topstar_laptop sparse_keymap acpi_als acpi_pad
[9.859194]  mac_hid kfifo_buf industrialio sch_fq cuse parport_pc
ppdev lp sunrpc parport autofs4 hid_generic usbhid hid i915
i2c_algo_bit drm_kms_helper syscopyarea r8169 sysfillrect sysimgblt
fb_sys_fops ahci drm libahci mii video
[9.859209] CPU: 3 PID: 1644 Comm: tc Not tainted 4.14.0-rc7-netem-3 #1
[9.859210] Hardware name: Intel Corporation SharkBay
Platform/WhiteTip Mountain1 Fab2, BIOS 5.6.5 06/18/2015
[9.859212] task: 8eb5c3ea1e80 task.stack: ac13c160c000
[9.859214] RIP: 0010:tcf_block_put+0xf/0x50
[9.859215] RSP: 0018:ac13c160f9b8 EFLAGS: 00010286
[9.859217] RAX: 8eb5c4febc01 RBX: 8eb5c6668000 RCX: 00f98e03
[9.859218] RDX: 00f98c03 RSI:  RDI: 
[9.859220] RBP: ac13c160f9c8 R08: 0001f820 R09: b37f0048
[9.859221] R10: cf324513fa80 R11: 0001 R12: 8eb5c6667000
[9.859222] R13: 0001 R14: 8eb5c4febe00 R15: 0001
[9.859223] FS:  7f322bc3e700() GS:8eb5ced8()
knlGS:
[9.859225] CS:  0010 DS:  ES:  CR0: 80050033
[9.859226] CR2: 0018 CR3: 000144f82002 CR4: 003606e0
[9.859227] Call Trace:
[9.859232]  htb_destroy_class.isra.18+0x3a/0x50 [sch_htb]
[9.859235]  htb_destroy+0xbe/0x160 [sch_htb]
[9.859237]  qdisc_destroy+0x5d/0xd0
[9.859240]  notify_and_destroy+0x2c/0x30
[9.859242]  qdisc_graft+0x2bd/0x370
[9.859244]  tc_get_qdisc+0x18f/0x230
[9.859250]  rtnetlink_rcv_msg+0x15c/0x270
[9.859252]  ? proto_seq_start+0x30/0x30
[9.859254]  ? __kmalloc_node_track_caller+0x35/0x2d0
[9.859256]  ? rtnl_calcit.isra.29+0x110/0x110
[9.859258]  netlink_rcv_skb+0xe7/0x120
[9.859260]  rtnetlink_rcv+0x15/0x20
[9.859262]  netlink_unicast+0x196/0x240
[9.859264]  netlink_sendmsg+0x2c5/0x3a0
[9.859267]  sock_sendmsg+0x38/0x50
[9.859269]  ___sys_sendmsg+0x2b6/0x2d0
[9.859272]  ? pagevec_lru_move_fn+0xc8/0xe0
[9.859274]  ? __activate_page+0x2c0/0x2c0
[9.859276]  ? __lru_cache_add+0x5f/0x80
[9.859278]  ? lru_cache_add_active_or_unevictable+0x36/0xb0
[9.859282]  __sys_sendmsg+0x54/0x90
[9.859284]  ? __sys_sendmsg+0x54/0x90
[9.859287]  SyS_sendmsg+0x12/0x20
[9.859290]  entry_SYSCALL_64_fastpath+0x1e/0xa9
[9.859292] RIP: 0033:0x7f322b274450
[9.859293] RSP: 002b:7ffcb4925018 EFLAGS: 0246 ORIG_RAX:
002e
[9.859295] RAX: ffda RBX: 7ffcb4925000 RCX: 7f322b274450
[9.859296] RDX:  RSI: 7ffcb4925060 RDI: 0003
[9.859297] RBP: 59f95db7 R08: 0014 R09: 
[9.859298] R10: 05e7 R11: 0246 R12: 7ffcb4925060
[9.859299] R13: 7ffcb493d2b0 R14: 00655aa0 R15: 
[9.859301] Code: 48 8d 53 30 bf 00 02 00 00 e8 ee 71 8a ff 5b 5d
f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 31 f6 48 89 e5
48 83 ec 10 <48> 8b 57 18 48 8d 4d f4 c7 45 f4 00 00 00 00 65 48 8b 04
25 28
[9.859329] RIP: tcf_block_put+0xf/0x50 RSP: ac13c160f9b8
[9.859330] CR2: 0018
[9.859332] ---[ end trace 92295ce23d2d8592 ]---

Things have gone wrong somewhere between

commit aa2bc739ef4a181a7589eb009be96a870cc1788f
Author: Jakub Kicinski 
Date:   Mon Oct 30 13:46:47 2017 -0700

and:

commit ca82214144d925155977abe7c0af7c2c252b837f
Merge: f5333f80 75c119a
Author: David S. Miller 
Date:   Sat Oct 7 00:28:54 2017 +0100

Merge branch 'tcp

Re: [PATCH net-next] bridge: multicast to unicast

2017-01-10 Thread Dave Taht

On Tue, Jan 10, 2017 at 9:23 AM, Felix Fietkau  wrote:
> On 2017-01-10 18:17, Dave Taht wrote:
>> In the case of wifi I have 3 issues with this line of thought.
>>
>> multicast in wifi has generally supposed to be unreliable. This makes
>> it reliable. reliability comes at a cost -
>>
>> multicast is typically set at a fixed low rate today. unicast is
>> retried at different rates until it succeeds - for every station
>> listening. If one station is already at the lowest rate, the total
>> cost of the transmit increases, rather than decreases.
>>
>> unicast gets block acks until it succeeds. Again, more delay.
>>
>> I think there is something like 31 soft-retries in the ath9k driver
> If I remember correctly, hardware retries are counted here as well.

I chopped this to something more reasonable but never got around to
quantifying it, so never pushed the patch. I figured I'd measure ATF
in a noisy environment (which I'd be doing now if it weren't for
https://bugs.lede-project.org/index.php?do=details&task_id=368 )
first.

>> what happens to diffserv markings here? for unicast CS1 goes into the
>> BE queue, CS6, the VO queue. Do we go from one flat queue for all of
>> multicast to punching it through one of the hardware queues based on
>> the diffserv mark now with this patch?

I meant CS1=BK here. Tracing the path through the bridge code made my
head hurt, I can go look at some aircaps to see if the mcast->unicast
conversion respects those markings or not (my vote is *not*).

>> I would like it if there was a way to preserve the unreliability
>> (which multiple mesh protocols depend on), send stuff with QoSNoack,
>> etc - or dynamically choose (based on the rates of the stations)
>> between conventional multicast and unicast.
>>
>> Or - better, IMHO, keep sending multicast as is but pick the best of
>> the rates available to all the listening stations for it.

> The advantage of the multicast-to-unicast conversion goes beyond simply
> selecting a better rate - aggregation matters a lot as well, and that is
> simply incompatible with normal multicast.

Except for the VO queue which cannot aggregate. And for that matter,
using any other hardware queue than BE tends to eat a txop that would
otherwise possibly be combined with an aggregate.

(and the VI queue has always misbehaved, long on my todo list)

> Some multicast streams use lots of small-ish packets, the airtime impact
> of those is vastly reduced, even if the transmission has to be
> duplicated for a few stations.

The question was basically how far up does it scale. Arguably, for a
very few, well connected stations, this patch would help. For a
network with more - and more badly connected stations, I think it
would hurt.

What sorts of multicast traffic are being observed that flood the
network sufficiently to be worth optimizing out? arp? nd? upnp? mdns?
uftp? tv?

(my questions above are related to basically trying to setup a sane
a/b test, I've been building up a new testbed in noisy environment to
match the one I have in a quiet one, and don't have any "good" mcast
tests defined. Has anyone done an a/b test of this code with some
repeatable test already?)

(In my observations... The only truly heavy creator of a multicast
"burp" has tended to be upnp and mdns on smaller networks. Things like
nd and arp get more problematic as the number of stations go up also.
I can try things like abusing vlc or uftp to see what happens?)

I certainly agree multicast is a "problem" (I've seen 20-80% or more
of a given wifi network eaten by multicast) but I'm not convinced that
making it reliable, aggregatable unicast scales much past
basement-level testing of a few "good" stations, and don't know which
protocols are making it worse, the worst, in typical environments.
Certainly apple gear puts out a lot of multicast.

...

As best as I recall a recommendation in the 802.11-2012 standard was
that multicast packets be rate-limited so that you'd have a fixed
amount of crap after each beacon sufficient to keep the rest of the
unicast traffic flowing rapidly, instead of dumping everything into a
given beacon transmit.

That, combined with (maybe) picking the "best" union of known rates
per station, was essentially the strategy I'd intended[1] to pursue
for tackling the currently infinite wifi multicast queue - fq the
entries, have a fairly short queue (codel is not the best choice here)
drop from head, and limit the number of packets transmitted per beacon
to spread them out. That would solve the issue for sparse multicast
(dhcp etc), and smooth out the burps from bigger chunks while
impacting conventional unicast minimally.

There's also the pursuit of less multicast overall at least in some protocols

https://tools.ietf.org/html/draft-ietf-dnssd-hybrid-05

>
> - Felix

[1] but make-wifi-fast has been out of funding since august

-- 
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

Re: [PATCH net-next] bridge: multicast to unicast

2017-01-10 Thread Dave Taht

In the case of wifi I have 3 issues with this line of thought.

multicast in wifi has generally supposed to be unreliable. This makes
it reliable. reliability comes at a cost -

multicast is typically set at a fixed low rate today. unicast is
retried at different rates until it succeeds - for every station
listening. If one station is already at the lowest rate, the total
cost of the transmit increases, rather than decreases.

unicast gets block acks until it succeeds. Again, more delay.

I think there is something like 31 soft-retries in the ath9k driver

what happens to diffserv markings here? for unicast CS1 goes into the
BE queue, CS6, the VO queue. Do we go from one flat queue for all of
multicast to punching it through one of the hardware queues based on
the diffserv mark now with this patch?

I would like it if there was a way to preserve the unreliability
(which multiple mesh protocols depend on), send stuff with QoSNoack,
etc - or dynamically choose (based on the rates of the stations)
between conventional multicast and unicast.

Or - better, IMHO, keep sending multicast as is but pick the best of
the rates available to all the listening stations for it.

Has anyone actually looked at the effects of this with, say, 5-10
stations at middlin to poor quality (longer distance)? using something
to measure the real effect of the multicast conversion? (uftp, mdns?)

Re: Misalignment, MIPS, and ip_hdr(skb)->version

2016-12-07 Thread Dave Taht

The openwrt tree has long contained a set of patches that correct for
unaligned issues throughout the linux network stack.

https://git.lede-project.org/?p=openwrt/source.git;a=blob;f=target/linux/ar71xx/patches-4.4/910-unaligned_access_hacks.patch;h=b4b749e4b9c02a74a9f712a2740d63e554de5c64;hb=ee53a240ac902dc83209008a2671e7fdcf55957a

unaligned access traps in the packet processing path on certain versions of
the mips architecture is horrifically bad. I had kind of hoped these
patches in some form would have made it upstream by now. (or the
arches that have the issue retired, I think it's mostly just mips24k)

Re: iwlwifi: mvm: add reorder buffer per queue

2016-05-16 Thread Dave Taht

I can't even describe how much I hate the concept of the reorder
buffer in general. Ordering is the endpoints problem.

Someday, after we get fq_codeled, short queues again, I'll be able to show why.

On Mon, May 16, 2016 at 4:41 AM, Luca Coelho  wrote:
> On Fri, 2016-05-13 at 11:54 +0300, Dan Carpenter wrote:
>> Hello Sara Sharon,
>>
>> The patch b915c10174fb: "iwlwifi: mvm: add reorder buffer per queue"
>> from Mar 23, 2016, leads to the following static checker warnings:
>>
>>   drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c:912
>> iwl_mvm_rx_mpdu_mq()
>>   error: potential NULL dereference 'sta'.
>>
>>   drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c:912
>> iwl_mvm_rx_mpdu_mq()
>>   error: we previously assumed 'sta' could be null (see line 796)
>
> Thanks for the analysis and report, Dan!
>
> I have queued a fix for this through our internal tree.
>
> --
> Cheers,
> Luca.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

Re: OpenWRT wrong adjustment of fq_codel defaults (Was: [Codel] fq_codel_drop vs a udp flood)

2016-05-16 Thread Dave Taht

On Mon, May 16, 2016 at 1:14 AM, Roman Yeryomin  wrote:
> On 16 May 2016 at 01:34, Roman Yeryomin  wrote:
>> On 6 May 2016 at 22:43, Dave Taht  wrote:
>>> On Fri, May 6, 2016 at 11:56 AM, Roman Yeryomin  
>>> wrote:
>>>> On 6 May 2016 at 21:43, Roman Yeryomin  wrote:
>>>>> On 6 May 2016 at 15:47, Jesper Dangaard Brouer  wrote:
>>>>>>
>>>>>> I've created a OpenWRT ticket[1] on this issue, as it seems that 
>>>>>> someone[2]
>>>>>> closed Felix'es OpenWRT email account (bad choice! emails bouncing).
>>>>>> Sounds like OpenWRT and the LEDE https://www.lede-project.org/ project
>>>>>> is in some kind of conflict.
>>>>>>
>>>>>> OpenWRT ticket [1] https://dev.openwrt.org/ticket/22349
>>>>>>
>>>>>> [2] 
>>>>>> http://thread.gmane.org/gmane.comp.embedded.openwrt.devel/40298/focus=40335
>>>>>
>>>>> OK, so, after porting the patch to 4.1 openwrt kernel and playing a
>>>>> bit with fq_codel limits I was able to get 420Mbps UDP like this:
>>>>> tc qdisc replace dev wlan0 parent :1 fq_codel flows 16 limit 256
>>>>
>>>> Forgot to mention, I've reduced drop_batch_size down to 32
>>>
>>> 0) Not clear to me if that's the right line, there are 4 wifi queues,
>>> and the third one
>>> is the BE queue.
>>
>> That was an example, sorry, should have stated that. I've applied same
>> settings to all 4 queues.
>>
>>> That is too low a limit, also, for normal use. And:
>>> for the purpose of this particular UDP test, flows 16 is ok, but not
>>> ideal.
>>
>> I played with different combinations, it doesn't make any
>> (significant) difference: 20-30Mbps, not more.
>> What numbers would you propose?
>>
>>> 1) What's the tcp number (with a simultaneous ping) with this latest 
>>> patchset?
>>> (I care about tcp performance a lot more than udp floods - surviving a
>>> udp flood yes, performance, no)
>>
>> During the test (both TCP and UDP) it's roughly 5ms in average, not
>> running tests ~2ms. Actually I'm now wondering if target is working at
>> all, because I had same result with target 80ms..
>> So, yes, latency is good, but performance is poor.
>>
>>> before/after?
>>>
>>> tc -s qdisc show dev wlan0 during/after results?
>>
>> during the test:
>>
>> qdisc mq 0: root
>>  Sent 1600496000 bytes 1057194 pkt (dropped 1421568, overlimits 0 requeues 
>> 17)
>>  backlog 1545794b 1021p requeues 17
>> qdisc fq_codel 8001: parent :1 limit 1024p flows 16 quantum 1514
>> target 80.0ms ce_threshold 32us interval 100.0ms ecn
>>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>>  backlog 0b 0p requeues 0
>>   maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
>>   new_flows_len 0 old_flows_len 0
>> qdisc fq_codel 8002: parent :2 limit 1024p flows 16 quantum 1514
>> target 80.0ms ce_threshold 32us interval 100.0ms ecn
>>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>>  backlog 0b 0p requeues 0
>>   maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
>>   new_flows_len 0 old_flows_len 0
>> qdisc fq_codel 8003: parent :3 limit 1024p flows 16 quantum 1514
>> target 80.0ms ce_threshold 32us interval 100.0ms ecn
>>  Sent 1601271168 bytes 1057706 pkt (dropped 1422304, overlimits 0 requeues 
>> 17)
>>  backlog 1541252b 1018p requeues 17
>>   maxpacket 1514 drop_overlimit 1422304 new_flow_count 35 ecn_mark 0
>>   new_flows_len 0 old_flows_len 1
>> qdisc fq_codel 8004: parent :4 limit 1024p flows 16 quantum 1514
>> target 80.0ms ce_threshold 32us interval 100.0ms ecn
>>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>>  backlog 0b 0p requeues 0
>>   maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
>>   new_flows_len 0 old_flows_len 0
>>
>>
>> after the test (60sec):
>>
>> qdisc mq 0: root
>>  Sent 3084996052 bytes 2037744 pkt (dropped 2770176, overlimits 0 requeues 
>> 28)
>>  backlog 0b 0p requeues 28
>> qdisc fq_codel 8001: parent :1 limit 1024p flows 16 quantum 1514
>> target 80.0ms ce_threshold 32us interval 100.0ms ecn
>>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>>  backlog 0b 0p requeues 0
>>   maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
>>   new_flows_len 0 old_flows_len 0
>> qdisc fq_codel 8002: pare

Re: OpenWRT wrong adjustment of fq_codel defaults (Was: [Codel] fq_codel_drop vs a udp flood)

2016-05-06 Thread Dave Taht

On Fri, May 6, 2016 at 11:56 AM, Roman Yeryomin  wrote:
> On 6 May 2016 at 21:43, Roman Yeryomin  wrote:
>> On 6 May 2016 at 15:47, Jesper Dangaard Brouer  wrote:
>>>
>>> I've created a OpenWRT ticket[1] on this issue, as it seems that someone[2]
>>> closed Felix'es OpenWRT email account (bad choice! emails bouncing).
>>> Sounds like OpenWRT and the LEDE https://www.lede-project.org/ project
>>> is in some kind of conflict.
>>>
>>> OpenWRT ticket [1] https://dev.openwrt.org/ticket/22349
>>>
>>> [2] 
>>> http://thread.gmane.org/gmane.comp.embedded.openwrt.devel/40298/focus=40335
>>
>> OK, so, after porting the patch to 4.1 openwrt kernel and playing a
>> bit with fq_codel limits I was able to get 420Mbps UDP like this:
>> tc qdisc replace dev wlan0 parent :1 fq_codel flows 16 limit 256
>
> Forgot to mention, I've reduced drop_batch_size down to 32

0) Not clear to me if that's the right line, there are 4 wifi queues,
and the third one
is the BE queue. That is too low a limit, also, for normal use. And:
for the purpose of this particular UDP test, flows 16 is ok, but not
ideal.

1) What's the tcp number (with a simultaneous ping) with this latest patchset?
(I care about tcp performance a lot more than udp floods - surviving a
udp flood yes, performance, no)

before/after?

tc -s qdisc show dev wlan0 during/after results?

IF you are doing builds for the archer c7v2, I can join in on this... (?)

I did do a test of the ath10k "before", fq_codel *never engaged*, and
tcp induced latencies under load, e at 100mbit, cracked 600ms, while
staying flat (20ms) at 100mbit. (not the same patches you are testing)
on x86. I have got tcp 300Mbit out of an osx box, similar latency,
have yet to get anything more on anything I currently have
before/after patchsets.

I'll go add flooding to the tests, I just finished a series comparing
two different speed stations and life was good on that.

"before" - fq_codel never engages, we see seconds of latency under load.

root@apu2:~# tc -s qdisc show dev wlp4s0
qdisc mq 0: root
 Sent 8570563893 bytes 6326983 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc fq_codel 0: parent :1 limit 10240p flows 1024 quantum 1514
target 5.0ms interval 100.0ms ecn
 Sent 2262 bytes 17 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 0: parent :2 limit 10240p flows 1024 quantum 1514
target 5.0ms interval 100.0ms ecn
 Sent 220486569 bytes 152058 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 18168 drop_overlimit 0 new_flow_count 1 ecn_mark 0
  new_flows_len 0 old_flows_len 1
qdisc fq_codel 0: parent :3 limit 10240p flows 1024 quantum 1514
target 5.0ms interval 100.0ms ecn
 Sent 8340546509 bytes 6163431 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 68130 drop_overlimit 0 new_flow_count 120050 ecn_mark 0
  new_flows_len 1 old_flows_len 3
qdisc fq_codel 0: parent :4 limit 10240p flows 1024 quantum 1514
target 5.0ms interval 100.0ms ecn
 Sent 9528553 bytes 11477 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 66 drop_overlimit 0 new_flow_count 1 ecn_mark 0
  new_flows_len 1 old_flows_len 0
  ```

>> This is certainly better than 30Mbps but still more than two times
>> less than before (900).

The number that I still am not sure we got is that you were sending
900mbit udp and recieving 900mbit on the prior tests?

>> TCP also improved a little (550 to ~590).

The limit is probably a bit low, also.  You might want to try target
20ms as well.

>>
>> Felix, others, do you want to see the ported patch, maybe I did something 
>> wrong?
>> Doesn't look like it will save ath10k from performance regression.

what was tcp "before"? (I'm sorry, such a long thread)

>>
>>>
>>> On Fri, 6 May 2016 11:42:43 +0200
>>> Jesper Dangaard Brouer  wrote:
>>>
 Hi Felix,

 This is an important fix for OpenWRT, please read!

 OpenWRT changed the default fq_codel sch->limit from 10240 to 1024,
 without also adjusting q->flows_cnt.  Eric explains below that you must
 also adjust the buckets (q->flows_cnt) for this not to break. (Just
 adjust it to 128)

 Problematic OpenWRT commit in question:
  http://git.openwrt.org/?p=openwrt.git;a=patch;h=12cd6578084e
  12cd6578084e ("kernel: revert fq_codel quantum override to prevent it 
 from causing too much cpu load with higher speed (#21326)")

 I also highly recommend you cherry-pick this very recent commit:
  net-next: 9d18562a2278 ("fq_codel: add batch ability to fq_codel_drop()")
  https://git.kernel.org/davem/net-next/c/9d18562a227

 This should fix very high CPU usage in-case fq_codel goes into drop mode.
 The problem is that drop mode was considered rare, and implementation
 wise it was chosen to be more expensive (to save cycles on normal mode).
 U

Re: [PATCH net-next] fq_codel: add batch ability to fq_codel_drop()

2016-05-02 Thread Dave Taht

I have duplicated the issue on my own hardware. I would like to
explore also upping the codel count in this scenario at some point,
but:

Acked-by: Dave Taht

8:25 arriving 9:07, if I catch it....

2016-03-19 Thread Dave Taht

Dave Täht
Let's go make home routers and wifi faster! With better software!
https://www.gofundme.com/savewifi

Re: [RFCv2 0/3] mac80211: implement fq codel

2016-03-19 Thread Dave Taht

On Thu, Mar 17, 2016 at 1:55 AM, Michal Kazior  wrote:

> I suspect the BK/BE latency difference has to do with the fact that
> there's bulk traffic going on BE queues (this isn't reflected
> explicitly in the plots). The `bursts` flent test includes short
> bursts of traffic on tid0 (BE) which is shared with ICMP and BE UDP_RR
> (seen as green and blue lines on the plot). Due to (intended) limited
> outflow (6mbps) BE queues build up and don't drain for the duration of
> the entire test creating more opportunities for aggregating BE traffic
> while other queues are near-empty and very short (time wise as well).

I agree with your explanation. Access to the media and queue length
are the two variables at play here.

I just committed a new flent test that should exercise the vo,vi,be,
and bk queues, "bursts_11e". I dropped the conventional ping from it
and just rely on netperf's udp_rr for each queue. It seems to "do the
right thing" on the ath9k

And while I'm all in favor of getting 802.11e's behaviors more right,
and this seems like a good way to get there...

netperf's udp_rr is not how much traffic conventionally behaves. It
doesn't do tcp slow start or congestion control in particular...

In the case of the VO queue, for example, the (2004) intended behavior
was 1 isochronous packet per 10ms per voice sending station and one
from the ap, not a "ping". And at the time, VI was intended to be
unicast video. TCP was an afterthought. (wifi's original (1993) mac
was actually designed for ipx/spx!)

I long for regular "rrul" and "rrul_be" tests against the new stuff to
blow it up thoroughly as references along the way.
(tcp_upload, tcp_download, (and several of the rtt_fair tests also
between stations)). Will get formal about it here as soon as we end up
on the same kernel trees

Furthermore 802.11e is not widely used - in particular, not much
internet bound/sourced traffic falls into more than BE and BK,
presently. and in some cases weirder - comcast remarks a very large
percentage of to the home inbound traffic as CS1 (BK), btw, and
stations tend to use CS0. Data comes in on BK, acks go out on BE.

I/we will try to come up with intermediate tests between the burst
tests and the rrul tests as we go along the way.

> If you consider Wi-Fi is half-duplex and latency in the entire stack

In the context of this test regime...

Saying wifi is "half"-duplex is a misleading way to think about it in
many respects. it is a shared medium more like early, non-switched
ethernet, with a weird mac that governs what sort of packets get
access to (a txop) the medium first, across all stations co-operating
within EDCA.

Half or full duplex is something that mostly applied to p2p serial
connections (or p2p wifi), not P2MP. Additionally characteristics like
exponential backoff make no sense were wifi any form of duplex, full
or half.

Certainly much stuff within a txop (block acks for example) can be
considered half duplex in a microcosmic context.

I wish we actually had words that accurately described wifi's actual behavior.

> (for processing ICMP and UDP_RR) is greater than 11e contention window
> timings you can get your BE flow responses with extra delay (since
> other queues might have responses ready quicker).

yes. always having a request pending for each of the 802.11e queues is
actually not the best idea, it is better to take advantage of better
aggregation afforded by 802.11n/ac, to only have one or two of the
queues in use against any given station and promote or demote traffic
into a more-right queue.

simple example of the damage having all 4 queues always contending is
exemplified by running the rrul and rrul_be tests against nearly any
given AP.

>
> I've modified traffic-gen and re-run tests with bursts on all tested
> tids/ACs (tid0, tid1, tid5). I'm attaching the results.
>
> With bursts on all tids you can clearly see BK has much higher latency than 
> BE.

The long term goal here, of course, is for BK (or the other queues) to
not have seconds of queuing latency but something more bounded to 2x
media access time...

> (Note, I've changed my AP to QCA988X with oldie firmware 10.1.467 for
> this test; it doesn't have the weird hiccups I was seeing on QCA99X0
> and newer QCA988X firmware reports bogus expected throughput which is
> most likely a result of my sloppy proof-of-concept change in ath10k).

So I should avoid ben greer's firmware for now?

>
>
> Michał
>
> On 16 March 2016 at 20:48, Jasmine Strong  wrote:
>> BK usually has 0 txop, so it doesn't do aggregation.
>>
>> On Wed, Mar 16, 2016 at 11:55 AM, Bob Copeland  wrote:
>>>
>>> On Wed, Mar 16, 2016 at 11:36:31AM -0700, Dave Taht wrote:
>>> > That is the sanest 802.1

Re: [RFCv2 0/3] mac80211: implement fq codel

2016-03-19 Thread Dave Taht

it is helpful to name the test files coherently in the flent tests, in
addition to using a directory structure and timestamp. It makes doing
comparison plots in data->add-other-open-data-files simpler. "-t
patched-mac-300mbps", for example.

Also netperf from svn (maybe 2.7, don't remember) will restart udp_rr
after a packet loss in 250ms. Seeing a loss on UDP_RR and it stop for
a while is "ok".
Dave Täht
Let's go make home routers and wifi faster! With better software!
https://www.gofundme.com/savewifi

On Wed, Mar 16, 2016 at 3:26 AM, Michal Kazior  wrote:
> On 16 March 2016 at 11:17, Michal Kazior  wrote:
>> Hi,
>>
>> Most notable changes:
> [...]
>>  * ath10k proof-of-concept that uses the new tx
>>scheduling (will post results in separate
>>email)
>
> I'm attaching a bunch of tests I've done using flent. They are all
> "burst" tests with burst-ports=1 and burst-length=2. The testing
> topology is:
>
>AP > STA
>AP )) (( STA
>  [veth]--[br]--[wlan] )) (( [wlan]
>
> You can notice that in some tests plot data gets cut-off. There are 2
> problems I've identified:
>  - excess drops (not a problem with the patchset and can be seen when
> there's no codel-in-mac or scheduling isn't used)
>  - UDP_RR hangs (apparently QCA99X0 I have hangs for a few hundred ms
> sometimes at times and doesn't Rx frames causing UDP_RR to stop
> mid-way; confirmed with logs and sniffer; I haven't figured out *why*
> exactly, could be some hw/fw quirk)
>
> Let me know if you have questions or comments regarding my testing/results.
>
>
> Michał

Re: [RFC/RFT] mac80211: implement fq_codel for software queuing

2016-03-10 Thread Dave Taht

>> regular fq_codel uses 1024 and there has not been much reason to
>> change it. In the case of an AP which has more limited memory, 256 or
>> 1024 would be a good setting, per station. I'd stick to 1024 for now.
>
> Do note that the 4096 is shared _across_ station-tid queues. It is not
> per-station. If you have 10 stations you still have 4096 flows
> (actually 4096 + 16*10, because each tid - and there are 16 - has it's
> own fallback flow in case of hash collision on the global flowmap to
> maintain per-sta-tid queuing).

I have to admit I didn't parse this well - still haven't, I think I
need to draw. (got a picture?)

Where is this part happening in the code (or firmware?)

" because each tid - and there are 16 - has it's
 own fallback flow in case of hash collision on the global flowmap to
 maintain per-sta-tid queuing"

"fallback flow - hash collision on global flowmap" - huh?

> With that in mind do you still think 1024 is enough?

Can't answer that question without understanding what you said above.

I assembled a few of the patches to date (your fq_codel patch, avery's
and tims ath9k stuff) and tested them, to no measurable effect,
against linus's tree a day or two back. I also acquired an ath10k card
- would one of these suit?

http://www.amazon.com/gp/product/B011SIMFR8?psc=1&redirect=true&ref_=oh_aui_detailpage_o08_s00

Re: [RFC/RFT] mac80211: implement fq_codel for software queuing

2016-03-07 Thread Dave Taht

sta->sta.txq[tid];
> +   struct ieee80211_sub_if_data *sdata;
> +   struct ieee80211_fq *fq;
> struct txq_info *txqi;
>
> if (!txq)
> return;
>
> txqi = to_txq_info(txq);
> +   sdata = vif_to_sdata(txq->vif);
> +   fq = &sdata->local->fq;
>
> /* Lock here to protect against further seqno updates on dequeue */
> -   spin_lock_bh(&txqi->queue.lock);
> +   spin_lock_bh(&fq->lock);
> set_bit(IEEE80211_TXQ_STOP, &txqi->flags);
> -   spin_unlock_bh(&txqi->queue.lock);
> +   spin_unlock_bh(&fq->lock);
>  }
>
>  static void
> diff --git a/net/mac80211/codel.h b/net/mac80211/codel.h
> new file mode 100644
> index ..f6f1b9b73a9a
> --- /dev/null
> +++ b/net/mac80211/codel.h
> @@ -0,0 +1,260 @@
> +#ifndef __NET_MAC80211_CODEL_H
> +#define __NET_MAC80211_CODEL_H
> +
> +/*
> + * Codel - The Controlled-Delay Active Queue Management algorithm
> + *
> + *  Copyright (C) 2011-2012 Kathleen Nichols 
> + *  Copyright (C) 2011-2012 Van Jacobson 
> + *  Copyright (C) 2016 Michael D. Taht 
> + *  Copyright (C) 2012 Eric Dumazet 
> + *  Copyright (C) 2015 Jonathan Morton 
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *notice, this list of conditions, and the following disclaimer,
> + *without modification.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *notice, this list of conditions and the following disclaimer in the
> + *documentation and/or other materials provided with the distribution.
> + * 3. The names of the authors may not be used to endorse or promote products
> + *derived from this software without specific prior written permission.
> + *
> + * Alternatively, provided that this notice is retained in full, this
> + * software may be distributed under the terms of the GNU General
> + * Public License ("GPL") version 2, in which case the provisions of the
> + * GPL apply INSTEAD OF those given above.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> + * DAMAGE.
> + *
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "codel_i.h"
> +
> +/* Controlling Queue Delay (CoDel) algorithm
> + * =
> + * Source : Kathleen Nichols and Van Jacobson
> + * http://queue.acm.org/detail.cfm?id=2209336
> + *
> + * Implemented on linux by Dave Taht and Eric Dumazet
> + */
> +
> +/* CoDel5 uses a real clock, unlike codel */
> +
> +static inline codel_time_t codel_get_time(void)
> +{
> +   return ktime_get_ns();
> +}
> +
> +static inline u32 codel_time_to_us(codel_time_t val)
> +{
> +   do_div(val, NSEC_PER_USEC);
> +   return (u32)val;
> +}
> +
> +/* sizeof_in_bits(rec_inv_sqrt) */
> +#define REC_INV_SQRT_BITS (8 * sizeof(u16))
> +/* needed shift to get a Q0.32 number from rec_inv_sqrt */
> +#define REC_INV_SQRT_SHIFT (32 - REC_INV_SQRT_BITS)
> +
> +/* Newton approximation method needs more iterations at small inputs,
> + * so cache them.
> + */
> +
> +static void codel_vars_init(struct codel_vars *vars)
> +{
> +   memset(vars, 0, sizeof(*vars));
> +}
> +
> +/*
> + * 
> http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Iterative_methods_for_reciprocal_square_roots
> + * new_invsqrt = (invsqrt / 2) * (3 - count * invsqrt^2)
> + *
> + * Here, invsqrt is a fixed point number (< 1.0), 32bit mantissa, aka Q0.32
> + */
> +static inline void codel_Newton_step(struct codel_vars *vars)
> +{
> +   u32 invsqrt = ((u32)vars->rec_inv_sqrt) << REC_INV_SQRT_

Re: [RFC/RFT] mac80211: implement fq_codel for software queuing

2016-03-07 Thread Dave Taht

On Mon, Mar 7, 2016 at 9:14 AM, Avery Pennarun  wrote:
> On Mon, Mar 7, 2016 at 11:54 AM, Dave Taht  wrote:
>> If I can just get a coherent patch set that I can build, I will gladly
>> join you on the testing ath9k in particular... can probably do ath10k,
>> too - and do a bit of code review... this week. it's very exciting
>> watching all this activity...
>>
>> but I confess that I've totally lost track of what set of trees and
>> patchwork I should try to pull from. wireless-drivers-next? ath10k?
>> wireless-next? net-next? toke and I have a ton of x86 platforms
>> available to test on
>>
>> Avery - which patches did you use??? on top of what?
>
> The patch series I'm currently using can be found here:
>
>   git fetch https://gfiber.googlesource.com/vendor/opensource/backports
> ath9k_txq+fq_codel

No common commits, but ok, thx for a buildable-looking tree.

d@dancer:~/git/linux$ git clone -b ath9k_txq+fq_codel --reference
net-next https://gfiber.googlesource.com/vendor/opensource/backports
Cloning into 'backports'...
warning: no common commits
remote: Sending approximately 30.48 MiB ...
remote: Counting objects: 4758, done
remote: Finding sources: 100% (5/5)
remote: Total 19312 (delta 12999), reused 19308 (delta 12999)
Receiving objects: 100% (19312/19312), 30.48 MiB | 6.23 MiB/s, done.
Resolving deltas: 100% (12999/12999), done.


>
> That's again backports-20160122, which comes from linux-next as of
> 20160122.  You can either build backports against whatever kernel
> you're using (probably easiest) or try to use that version of
> linux-next, or rebase the patches onto your favourite kernel.
>
>> In terms of "smoothing" codel...
>>
>> I emphatically do not think codel in it's current form is "ready" for
>> wireless, at the very least the target should not be much lower than
>> 20ms in your 2 station tests.  There is another bit in codel where the
>> algo "turns off" with only a single MTU's worth of packets
>> outstanding, which could get bumped to the ideal size of the
>> aggregate. "ideal" kind of being a variable based on a ton of other
>> factors...
>
> Yeah, I figured that sort of thing would come up.  I'm feeling forward
> progress just by finally seeing the buggy oscillations finally happen,
> though. :)

It's *very* exciting to see y'all break things in a measurable, yet
positive direction.

>
>> the underlying code needs to be striving successfully for per-station
>> airtime fairness for this to work at all, and the driver/card
>> interface nearly as tight as BQL is for the fq portion to behave
>> sanely. I'd configure codel at a higher target and try to observe what
>> is going on at the fq level til that got saner.
>
> That seems like two good goals.  So Emmanuel's BQL-like thing seems
> like we'll need it soon.
>
> As for per-station airtime fairness, what's a good approximation of
> that?  Perhaps round-robin between stations, one aggregate per turn,
> where each aggregate has a maximum allowed latency?

Strict round robin is a start, and simplest, yes. Sure.

"Oldest station queues first" on a round (probably) has higher
potential for maximizing txops, but requires more overhead. (shortest
queue first would be bad). There's another algo based on last received
packets from a station possibly worth fiddling with in the long run...

as "maximum allowed latency" - well, to me that is eventually also a
variable, based on the number of stations that have to be scheduled on
that round. Trying to get away from 10 stations eating 5.7ms each +
return traffic on a round would be nicer. If you want a constant, for
now, aim for 2048us or 1TU.

> I don't know how
> the current code works, but it's probably almost like that, as long as
> we only put one aggregate's worth of stuff into each hwq, which I
> guess is what the BQL-like thing will do.

I would avoid trying to think about or using 802.11e's 4 queues at the
moment[1]. We also have fallout from mu-mimo to deal with, eventually,
also, but gang scheduling starts to fall out naturally from these
structures and methods...

>
> So if I understand correctly, what we need is, in the following order:
> 1) Disable fq_codel for now, and get BQL-like thing working in ath9k
> (and ensure we're getting airtime fairness even without fq_codel);
> 2) Re-enable fq_codel and increase fq_codel's target up to 20ms for now;
> 3) Tweak fq_codel's "turn off" size to be larger (how important is this?)
>
> Is that right?

Sounds good. I have not reviewed the codel5 based implementation, it
may not even have idea "#

Re: [RFC/RFT] mac80211: implement fq_codel for software queuing

2016-03-07 Thread Dave Taht

If I can just get a coherent patch set that I can build, I will gladly
join you on the testing ath9k in particular... can probably do ath10k,
too - and do a bit of code review... this week. it's very exciting
watching all this activity...

but I confess that I've totally lost track of what set of trees and
patchwork I should try to pull from. wireless-drivers-next? ath10k?
wireless-next? net-next? toke and I have a ton of x86 platforms
available to test on

Avery - which patches did you use??? on top of what?

In terms of "smoothing" codel...

I emphatically do not think codel in it's current form is "ready" for
wireless, at the very least the target should not be much lower than
20ms in your 2 station tests.  There is another bit in codel where the
algo "turns off" with only a single MTU's worth of packets
outstanding, which could get bumped to the ideal size of the
aggregate. "ideal" kind of being a variable based on a ton of other
factors...

the underlying code needs to be striving successfully for per-station
airtime fairness for this to work at all, and the driver/card
interface nearly as tight as BQL is for the fq portion to behave
sanely. I'd configure codel at a higher target and try to observe what
is going on at the fq level til that got saner.

There are so many other variables and things left unaccounted for, as yet.

Re: [RFC/RFT] mac80211: implement fq_codel for software queuing

2016-03-03 Thread Dave Taht

On Tue, Mar 1, 2016 at 11:38 PM, Michal Kazior  wrote:
> On 1 March 2016 at 15:02, Johannes Berg  wrote:
>> On Fri, 2016-02-26 at 14:09 +0100, Michal Kazior wrote:
>>>
>>> +typedef u64 codel_time_t;
>>
>> Do we really need this? And if yes, does it have to be in the public
>> header file? Why a typedef anyway?
>
> Hmm.. I don't think it's strictly necessary. I just wanted to keep as
> much from the original codel implementation as possible. I'm fine with
> using just u64.

This is an artifact of the original codel keeping time in (nsec >> 10)
to fit into a 32 bit int.

In codel5 we switched to native 64 bit timekeeping as simpler, to
improve logging and reason about.

u64 is fine.

>
>>> - * @txq_ac_max_pending: maximum number of frames per AC pending in
>>> all txq
>>> - *   entries for a vif.
>>> + * @txq_cparams: codel parameters to control tx queueing dropping
>>> behavior
>>> + * @txq_limit: maximum number of frames queuesd
>>
>> typo - queued
>>
>>> @@ -2133,7 +2155,8 @@ struct ieee80211_hw {
>>>   u8 uapsd_max_sp_len;
>>>   u8 n_cipher_schemes;
>>>   const struct ieee80211_cipher_scheme *cipher_schemes;
>>> - int txq_ac_max_pending;
>>> + struct codel_params txq_cparams;
>>
>> Should this really be a parameter the driver determines?
>
> I would think so, or at least it should be able to influence it in
> *some* way. You can have varying degree of induced latency depending
> on fw/hw tx queue depth, air conditions and possible tx rates implying
> different/varying RTT. Cake[1] even has a few RTT presets like: lan,
> internet, satellite.

While those presets have been useful in testing codel and (more
generically in cake we can rapidly change the bandwidth from userspace
for testing), in the real world you don't move from orbit to desktop
and back as rapidly as wifi does.

> I don't really have a plan how exactly a driver could make use of it
> for benefit though. It might end up not being necessary after all if
> generic tx scheduling materializes in mac80211.

What we envisioned here is ewma smoothing the target based on the
total service time needed for all active stations, per round. (There
are other possible approaches)

Say you serve 10 stations at 1ms each in one round, a codel target of
5ms will try to push things down too far.  If in the next round, you
only serve 2 stations at 1ms each (but get back 10 responses at .5ms
each), you're still too high. If it's just one station, well, you can
get below 2ms if the driver is only sending 1ms, but maybe it's
sending 5ms...

If you have a large multicast burst, that burp will cause lost packets.

Merely getting typical wifi latencies under load down below the 20ms
range would be a good start, after that some testing, hard thought,
and evaluation are going to be needed. for early testing I think a
20ms fixed target would be safer than the existing 5ms.

Pushing the fq part of fq_codel on a per station basis as close to the
hardware as possible, and having better airtime fairness between
stations is a huge win in itself.





>
> [1]: http://www.bufferbloat.net/projects/codel/wiki/Cake
>
>
>>> +static void ieee80211_if_setup_no_queue(struct net_device *dev)
>>> +{
>>> + ieee80211_if_setup(dev);
>>> + dev->priv_flags |= IFF_NO_QUEUE;
>>> + /* Note for backporters: use dev->tx_queue_len = 0 instead
>>> of IFF_ */
>>
>> Heh. Remove that comment; we have an spatch in backports already :)
>
> I've put it in the RFC specifically in case anyone wants to port this
> bypassing backports, e.g. to openwrt's quilt (so when compilation
> fails, you can quickly fix it up). I'll remove it before proper
> submission obviously :)
>
>
>>> --- a/net/mac80211/sta_info.h
>>> +++ b/net/mac80211/sta_info.h
>>> @@ -19,6 +19,7 @@
>>>  #include 
>>>  #include 
>>>  #include "key.h"
>>> +#include "codel_i.h"
>>>
>>>  /**
>>>   * enum ieee80211_sta_info_flags - Stations flags
>>> @@ -327,6 +328,32 @@ struct mesh_sta {
>>>
>>>  DECLARE_EWMA(signal, 1024, 8)
>>>
>>> +struct txq_info;
>>> +
>>> +/**
>>> + * struct txq_flow - per traffic flow queue
>>> + *
>>> + * This structure is used to distinguish and queue different traffic
>>> flows
>>> + * separately for fair queueing/AQM purposes.
>>> + *
>>> + * @txqi: txq_info structure it is associated at given time
>>> + * @flowchain: can be linked to other flows for RR purposes
>>> + * @backlogchain: can be linked to other flows for backlog sorting
>>> purposes
>>> + * @queue: sk_buff queue
>>> + * @cvars: codel state vars
>>> + * @backlog: number of bytes pending in the queue
>>> + * @deficit: used for fair queueing balancing
>>> + */
>>> +struct txq_flow {
>>> + struct txq_info *txqi;
>>> + struct list_head flowchain;
>>> + struct list_head backlogchain;
>>> + struct sk_buff_head queue;
>>> + struct codel_vars cvars;
>>> + u32 backlog;
>>> + u32 deficit;
>>> +};
>>> +
>>>  /**
>>>   * struct sta_info - STA information
>>>   *
>>> diff --git a/net/mac80211/tx.c b/net/mac80

Re: [PATCH] codel: add forgotten inline to functions in header file

2016-02-11 Thread Dave Taht

On Thu, Feb 11, 2016 at 7:29 AM, Grumbach, Emmanuel
 wrote:
>
>
> On 02/11/2016 05:12 PM, Eric Dumazet wrote:
>> On Thu, 2016-02-11 at 15:05 +, Grumbach, Emmanuel wrote:
>>
>>
>>> Yeah :) codel_should_drop seemed very long indeed... I wanted to use the
>>> codel_get_time and associated utils (_before, _after) in iwlwifi.
>>> They're better than jiffies... So maybe I can just copy that code to
>>> iwlwifi.

Definately better than jiffies.

>>
>> You certainly can submit a patch adding the inline, but not on all
>> functions present in this file ;)
>>
>> Thanks !
>>
>
> Actually... All I need *has* the inline, but if I include codel.h,
> codel_dequeue is defined but not used and you definitely don't want to
> inline that one. So I guess the only other option I have is to split
> that header file which I don't think is really worth it. So, unless you
> object it, I'll just copy the code.

I think it is best to start with another base implementation of codel
for wifi, yes.

What I think is the currently best performing codel implementation (on
the wire, for ethernet) we have is in:

https://github.com/dtaht/bcake/codel5.h

which has a few differences from eric's implementation (64 bit
timestamps, inlining, not a lot of cpu profiling on it - still aiming
for algorithmic correctness here, it is closer to the original
paper... We'd used a different means of injecting the callback in it,
too)

The one currently in the main cake had issues in the last test round
but has been updated since. (sch_cake is also on github).

In neither case it is the right thing for wifi either.

the "starting plan" such as it was was to get to "one aggregate in the
hardware, one in the driver, one ready to be formed on the completion
interrupt", and pull a smoothed service time from start to completion
interrupt to dynamically modify the codel target. (other headaches,
like multicast, abound).

(It's the per station queue + fq as close to the hardware as possible
that matters most, IMHO.)

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] codel: add forgotten inline to functions in header file

2016-02-11 Thread Dave Taht

On Thu, Feb 11, 2016 at 7:05 AM, Grumbach, Emmanuel
 wrote:
> fixing linux-wireless address ...
>
> On 02/11/2016 04:30 PM, Eric Dumazet wrote:
>> On Thu, 2016-02-11 at 16:08 +0200, Emmanuel Grumbach wrote:
>>> Signed-off-by: Emmanuel Grumbach 
>>> ---
>>> -static bool codel_should_drop(const struct sk_buff *skb,
>>> -  struct Qdisc *sch,
>>> -  struct codel_vars *vars,
>>> -  struct codel_params *params,
>>> -  struct codel_stats *stats,
>>> -  codel_time_t now)
>>> +static inline bool codel_should_drop(const struct sk_buff *skb,
>>> + struct Qdisc *sch,
>>> + struct codel_vars *vars,
>>> + struct codel_params *params,
>>> + struct codel_stats *stats,
>>> + codel_time_t now)
>>
>> The lack of inline was done on purpose.
>>
>> This include file is kind of special, being included by codel and
>> fq_codel.
>>
>> Hint : we do not want to force the compiler to inline
>> codel_should_drop() (or any other function).
>>
>>
>> See this file as if it was a .c really.
>>
>>
>
> Yeah :) codel_should_drop seemed very long indeed... I wanted to use the
> codel_get_time and associated utils (_before, _after) in iwlwifi.
> They're better than jiffies... So maybe I can just copy that code to
> iwlwifi.

I need to stress that codel as is is not the right thing for wifi,
particularly point to multipoint wifi in highly contended scenarios.
It IS a starting point. We have generally felt that the target needs
to be offset against the actual service opportunities, and the effects
of multicast (with powersave) and other "background" frames, needs to
be smoothed out.

Lacking hardware that can do that, or adaquate sims, has stalled
trying to come up with "the right thing". It looks like you are
putting in place more of the pieces to get there in some tree
somewhere?

Re: [RFC v2] iwlwifi: pcie: transmit queue auto-sizing

2016-02-05 Thread Dave Taht

> A bursted txop can be as big as 5-10ms. If you consider you want to
> queue 5-10ms worth of data for *each* station at any given time you
> obviously introduce a lot of lag. If you have 10 stations you might
> end up with service period at 10*10ms = 100ms. This gets even worse if
> you consider MU-MIMO because you need to do an expensive sounding
> procedure before transmitting. So while SU aggregation can probably
> still work reasonably well with shorter bursts (1-2ms) MU needs at
> least 3ms to get *any* gain when compared to SU (which obviously means
> you want more to actually make MU pay off).

I am not sure where you get these numbers. Got a spreadsheet?

Gradually reducing the maximum sized txop as a function of the number
of stations makes sense. If you have 10 stations pending delivery and
reduced the max txop to 1ms, you hurt bandwidth at that instant, but
by offering more service to more stations, in less time, they will
converge on a reasonable share of the bandwidth for each, faster[1].
And I'm sure that the person videoconferencing on a link like that
would appreciate getting some service inside of a 10ms interval,
rather than a 100ms.

yes, there's overhead, and that's not the right number, which would
vary as to g,n,ac and successors.

You will also get more opportunities to use mu-mimo with shorter
bursts extant and more stations being regularly serviced.

[1] https://www.youtube.com/watch?v=Rb-UnHDw02o at about 13:50

> The rule of thumb is the
> longer you wait the bigger capacity you can get.

This is not strictly true as the "fountain" of packets is regulated by
acks on the other side of the link, and ramp up or down as a function
of service time and loss.

>
> Apparently there's interest in maximizing throughput but it stands in
> direct opposition of keeping the latency down so I've been thinking
> how to satisfy both.
>
> The current approach ath10k is taking (patches in review [1][2]) is to
> use mac80211 software queues for per-station queuing, exposing queue
> state to firmware (it decides where frames should be dequeued from)
> and making it possible to stop/wake per-station tx subqueue with fake
> netdev queues. I'm starting to think this is not the right way though
> because it's inherently hard to control latency and there's a huge
> memory overhead associated with the fake netdev queues.

What is this overhead?

Applying things  like codel tend to dramatically shorten the amount of
skbs extant... modern 802.11ac capable hardware has tons more
memory...

> Also fq_codel
> is a less effective with this kind of setup.

fq_codel's principal problems with working with wifi are long and
documented in the talk above.

> My current thinking is that the entire problem should be solved via
> (per-AC) qdiscs, e.g. fq_codel. I guess one could use
> limit/target/interval/quantum knobs to tune it for higher latency of
> aggregation-oriented Wi-Fi links where long service time (think
> 100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi
> works in the first place, i.e. Wi-Fi gets better throughput if you
> deliver bursts of packets destined to the same station. Moreover this
> gets even more complicated with MU-MIMO where you may want to consider
> spatial location (which influences signal quality when grouped) of
> each station when you decide which set of stations you're going to
> aggregate to in parallel. Since drivers have a finite tx ring this it
> is important to deliver bursts that can actually be aggregated
> efficiently. This means driver would need to be able to tell qdisc
> about per-flow conditions to influence the RR scheme in some way
> (assuming a qdiscs even understands flows; do we need a unified way of
> talking about flows between qdiscs and drivers?).

This is a very good summary of the problems in layering fq_codel as it
exists today on top of wifi as it exists today. :/ Our conclusion
several years ago was that as the information needed to do things more
right was in the mac80211 layer that we could not evolve the qdisc
layer to suit, and needed to move the core ideas into the mac80211
layer.

Things have evolved since, but I still think we can't get enough info
up to the qdisc layer (locks and so on) to use it sanely.

>
> [1]: https://www.spinics.net/lists/linux-wireless/msg146187.html
> [2]: https://www.spinics.net/lists/linux-wireless/msg146512.html

I will review!

>
 For reference, ath10k has around 1400 tx descriptors, though
 in practice not all are usable, and in stock firmware, I'm guessing
 the NIC will never be able to actually fill up it's tx descriptors
 and stop traffic.  Instead, it just allows the stack to try to
 TX, then drops the frame...
>>>
>>>
>>> 1400 descriptors, ok... but they are not organised in queues?
>>> (forgive my ignorance of athX drivers)
>>
>>
>> I think all the details are in the firmware, at least for now.
>
> Yeah. Basically ath10k has a flat set of tx descriptors which are
> AC-agnostic.

Re: [RFC RESEND] iwlwifi: pcie: transmit queue auto-sizing

2016-02-04 Thread Dave Taht

o that is in use on the mt72 chipset that felix is working
on... but nowhere else so far as I know (as yet).

the iwl does it's own aggregation (I think(?))... but estimates can
still be made...

There are WAY more details of course - per station queuing, a separate
multicast queue, only some in that talk!, but my hope was that under
good conditions we'd get wireless-n down below 12ms driver overhead,
even at 6mbit, before something like fq_codel could kick in (under
good conditions! Plenty of other potential latency sources beside
excessive queuing in wifi!). My ideal world would be to hold it at
under 1250us at higher rates

Periodically sampling seems like a reasonable approach under lab
conditions but it would be nicer to have feedback from the firmware -
"I transmitted the last tx as an X byte aggregate, at MCS1, I had to
retransmit a few packets once, it took me 6ms to acquire the media, I
heard 3 other stations transmitting, etc.".

The above info we know we can get from a few chipsets, but not enough
was known about the iwl last I looked. And one reason why fq_codel -
unassisted - is not quite the right thing on top of this is that
multicast can take a really long time...

Regardless, I'd highly love to see/use this patch myself in a variety
of real world conditions and see what happens. And incremental
progress is the only way forward. Thx for cheering me up.

>
> Cc: Stephen Hemminger 
> Cc: Dave Taht 
> Cc: Jonathan Corbet 
> Signed-off-by: Emmanuel Grumbach 
> ---
> Fix Dave's email address
> ---
>  drivers/net/wireless/intel/iwlwifi/pcie/internal.h |  6 
>  drivers/net/wireless/intel/iwlwifi/pcie/tx.c   | 32 
> --
>  2 files changed, 35 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/internal.h 
> b/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
> index 2f95916..d83eb56 100644
> --- a/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
> +++ b/drivers/net/wireless/intel/iwlwifi/pcie/internal.h
> @@ -192,6 +192,11 @@ struct iwl_cmd_meta {
> u32 flags;
>  };
>
> +struct iwl_txq_auto_size {
> +   int min_space;
> +   unsigned long reset_ts;
> +};
> +
>  /*
>   * Generic queue structure
>   *
> @@ -293,6 +298,7 @@ struct iwl_txq {
> bool block;
> unsigned long wd_timeout;
> struct sk_buff_head overflow_q;
> +   struct iwl_txq_auto_size auto_sz;
>  };
>
>  static inline dma_addr_t
> diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c 
> b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
> index 837a7d5..4d1dee6 100644
> --- a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
> +++ b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c
> @@ -572,6 +572,8 @@ static int iwl_pcie_txq_init(struct iwl_trans *trans, 
> struct iwl_txq *txq,
>
> spin_lock_init(&txq->lock);
> __skb_queue_head_init(&txq->overflow_q);
> +   txq->auto_sz.min_space = 240;
> +   txq->auto_sz.reset_ts = jiffies;
>
> /*
>  * Tell nic where to find circular buffer of Tx Frame Descriptors for
> @@ -1043,10 +1045,14 @@ void iwl_trans_pcie_reclaim(struct iwl_trans *trans, 
> int txq_id, int ssn,
>  q->read_ptr != tfd_num;
>  q->read_ptr = iwl_queue_inc_wrap(q->read_ptr)) {
> struct sk_buff *skb = txq->entries[txq->q.read_ptr].skb;
> +   struct ieee80211_tx_info *info;
> +   unsigned long tx_time;
>
> if (WARN_ON_ONCE(!skb))
> continue;
>
> +   info = IEEE80211_SKB_CB(skb);
> +
> iwl_pcie_free_tso_page(skb);
>
> __skb_queue_tail(skbs, skb);
> @@ -1056,6 +1062,18 @@ void iwl_trans_pcie_reclaim(struct iwl_trans *trans, 
> int txq_id, int ssn,
> iwl_pcie_txq_inval_byte_cnt_tbl(trans, txq);
>
> iwl_pcie_txq_free_tfd(trans, txq);
> +
> +   tx_time = 
> (uintptr_t)info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 2];
> +   if (time_before(jiffies, tx_time + msecs_to_jiffies(10))) {
> +   txq->auto_sz.min_space -= 10;
> +   txq->auto_sz.min_space =
> +   max(txq->auto_sz.min_space, txq->q.high_mark);
> +   } else if (time_after(jiffies,
> + tx_time + msecs_to_jiffies(20))) {
> +   txq->auto_sz.min_space += 10;
> +   txq->auto_sz.min_space =
> +   min(txq->auto_sz.min_space, 252);
> +   }
> }
>
> iwl_pcie_txq_progress(txq);
> @@ -

Re: [RFC PATCH net-next] tcp: add CDG congestion control

2015-06-04 Thread Dave Taht

On Thu, Jun 4, 2015 at 9:19 AM, Kenneth Klette Jonassen
 wrote:
>> why not just call tcp_init_cwnd_reduction()?
>
> I deferred considering the ECN implications of doing so. The code to
> start PRR was based on tcp_enter_cwr()/tcp_init_cwnd_reduction(), save
> that both of these functions ensure a call to tcp_ecn_queue_cwr().
>
> The upcoming patch set instead exports tcp_enter_cwr() and calls it,
> which sets CWR on ECN-capable connections.
>
> Consider CDG at the sender-side. The RFC 3168 Errata says that a
> receiver should:
> "First, reset ECE because of CWR
> Second, set ECE because of CE"
>
> Thus, in "good networks", setting CWR on a CDG backoff should not
> affect ECN signalling – we still get ECE from the receiver.

On the AQM front I have been trying to deal with queue overload in the presence
of ECN with drop, then mark behavior. This handles the case of an unresponsive
(lying) flow, and overload in general, on the queue side.

... however it is unclear to me how well various tcps are responding
to the combination
of these two signals, and finding a threshold to switch to drop + mark is hard.

I have certainly had cases where every incoming packet ended up marked
CE, to not
enough effect, fast enough.

> It is conceivable that fringe scenarios of ECN in the forward path and
> losses in the ACK path creates a situation where CDG's backoff could
> inhibit some ECN signals from the receiver. That is, setting CWR will
> inhibit the signalling reliability of standard ECN in TCP. But we are
> arguably still in good faith since we have reduced the congestion
> window by some beta.



-- 
Dave Täht
What will it take to vastly improve wifi for everyone?
https://plus.google.com/u/0/explore/makewififast
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

100 matches

Mail list logo