Re: [Bloat] The Dark Problem with AQM in the Internet?

2014-09-01 Thread Richard Scheffenegger
Hi Jerry,

isn't this the problem statement of Conex?

Again, you at the end host would gain little insight with Conex, but every 
intermediate network operator can observe the red/black marked packets, compare 
the ratios and know to what extent (by looking at ingress vs egress into his 
network ) he is contributing...

Best regards,
  Richard

  - Original Message - 
  From: Jerry Jongerius 
  To: 'Rich Brown' 
  Cc: bloat@lists.bufferbloat.net 
  Sent: Thursday, August 28, 2014 6:20 PM
  Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?


  It add accountability.  Everyone in the path right now denies that they could 
possibly be the one dropping the packet.

   

  If I want (or need!) to address the problem, I can't now.  I would have to 
make a change and just hope that it fixed the problem.

   

  With accountability, I can address the problem.  I then have a choice.  If 
the problem is the ISP, I can switch ISP's.  If the problem is the mid-level 
peer or the hosting provider, I can test out new hosting providers.

   

  - Jerry

   

   

   

  From: Rich Brown [mailto:richb.hano...@gmail.com] 
  Sent: Thursday, August 28, 2014 10:39 AM
  To: Jerry Jongerius
  Cc: Greg White; Sebastian Moeller; bloat@lists.bufferbloat.net
  Subject: Re: [Bloat] The Dark Problem with AQM in the Internet?

   

  Hi Jerry,

   

AQM is a great solution for bufferbloat.  End of story.  But if you want to 
track down which device in the network intentionally dropped a packet (when 
many devices in the network path will be running AQM), how are you going to do 
that?  Or how do youpropose to do that?

   

  Yes, but... I want to understand why you are looking to know which device 
dropped the packet. What would you do with the information?

   

  The great beauty of fq_codel is that it discards packets that have dwelt too 
long in a queue by actually *measuring* how long they've been in the queue. 

   

  If the drops happen in your local gateway/home router, then it's interesting 
to you as the operator of that device. If the drops happen elsewhere (perhaps 
some enlightened ISP has installed fq_codel, PIE, or some other zoomy queue 
discipline) then they're doing the right thing as well - they're managing their 
traffic as well as they can. But once the data leaves your gateway router, you 
can't make any further predictions.

   

  The SQM/AQM efforts of CeroWrt/fq_codel are designed to give near optimal 
performance of the *local* gateway, to make it adapt to the remainder of the 
(black box) network. It might make sense to instrument the CeroWrt/OpenWrt code 
to track the number of fq_codel drops to come up with a sense of what's 
'normal'. And if you need to know exactly what's happening, then 
tcpdump/wireshark are your friends. 

   

  Maybe I'm missing the point of your note, but I'm not sure there's anything 
you can do beyond your gateway. In the broader network, operators are 
continually watching their traffic and drop rates, and adjusting/reconfiguring 
their networks to adapt. But in general, it's impossible for you to have any 
sway/influence on their operations, so I'm not sure what you would do if you 
could know that the third router in traceroute was dropping...

   

  Best regards,

   

  Rich



--


  ___
  Bloat mailing list
  Bloat@lists.bufferbloat.net
  https://lists.bufferbloat.net/listinfo/bloat
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Comcast upped service levels - WNDR3800 can't cope...

2014-09-01 Thread Dave Taht
On Sun, Aug 31, 2014 at 3:18 AM, Jonathan Morton chromati...@gmail.com wrote:

 On 31 Aug, 2014, at 1:30 am, Dave Taht wrote:

 Could I get you to also try HFSC?

 Once I got a kernel running that included it, and figured out how to make it 
 do what I wanted...

 ...it seems to be indistinguishable from HTB and FQ in terms of CPU load.

If you are feeling really inspired, try cbq. :) One thing I sort of
like about cbq is that it (I think)
(unlike htb presently) operates off an estimated size for the next
packet (which isn't dynamic, sadly),
where the others buffer up an extra packet until they can be delivered.

In my quest for absolutely minimal latency I'd love to be rid of that
last extra non-in-the-fq_codel-qdisc packet... either with a peek
operation or with a running estimate. I think this would (along with
killing the maxpacket check in codel) allow for a faster system with
less tuning (no tweaks below 2.5mbit in particular) across the entire
operational range of ethernet.

There would also need to be some support for what I call GRO
slicing, where a large receive is split back into packets if a drop
decision could be made.

It would be cool to be able to program the ethernet hardware itself to
return completion interrupts at a given transmit rate (so you could
program the hardware to be any bandwidth not just 10/100/1000). Some
hardware so far as I know supports this with a pacing feature.

This doesn't help on inbound rate limiting, unfortunately, just egress.

 Actually, I think most of the CPU load is due to overheads in the 
 userspace-kernel interface and the device driver, rather than the qdiscs 
 themselves.

You will see it bound by the softirq thread, but, what, exactly,
inside that, is kind of unknown. (I presently lack time to build up
profilable kernels on these low end arches. )

 Something about TBF causes more overhead - it goes through periods of lower 
 CPU use similar to the other shapers, but then spends periods at considerably 
 higher CPU load, all without changing the overall throughput.

 The flip side of this is that TBF might be producing a smoother stream of 
 packets.  The receiving computer (which is fast enough to notice such things) 
 reports a substantially larger number of recv() calls are required to take in 
 the data from TBF than from anything else - averaging about 4.4KB rather than 
 9KB or so.  But at these data rates, it probably matters little.

Well, htb has various tuning options (see quantum and burst) that
alter it's behavior along the lines of what you re seeing from tbf.


 FWIW, apparently Apple's variant of the GEM chipset doesn't support jumbo 
 frames.  This does, however, mean that I'm definitely working with an MTU of 
 1500, similar to what would be sent over the Internet.

 These tests were all run using nttpc.  I wanted to finally try out RRUL, but 
 the wrappers fail to install via pip on my Gentoo boxes.  I'll need to 
 investigate further before I can make pretty graphs like everyone else.

  - Jonathan Morton




-- 
Dave Täht

NSFW: 
https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


[Bloat] What is wrong with Microsoft's receive window auto-tuning?

2014-09-01 Thread Jerry Jongerius
I am noticing (via WireShark traces) at times that Microsoft's (Windows 7)
receive window auto-tuning goes horribly wrong, causing significant buffer
bloat.  And at other times, the tuning appears to work just fine.

For example, BDP suggests a receive window of 750k, and most often Windows
tunes around 1MB -- but at other times, it will tune to 3.8MB (way more than
it should).

Is anyone aware of any research either pointing out how their tuning
algorithm works, or of known bugs/problems with the algorithm?


___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Comcast upped service levels - WNDR3800 can't cope...

2014-09-01 Thread Jonathan Morton

On 1 Sep, 2014, at 8:01 pm, Dave Taht wrote:

 On Sun, Aug 31, 2014 at 3:18 AM, Jonathan Morton chromati...@gmail.com 
 wrote:
 
 On 31 Aug, 2014, at 1:30 am, Dave Taht wrote:
 
 Could I get you to also try HFSC?
 
 Once I got a kernel running that included it, and figured out how to make it 
 do what I wanted...
 
 ...it seems to be indistinguishable from HTB and FQ in terms of CPU load.
 
 If you are feeling really inspired, try cbq. :) One thing I sort of like 
 about cbq is that it (I think)
 (unlike htb presently) operates off an estimated size for the next packet 
 (which isn't dynamic, sadly),
 where the others buffer up an extra packet until they can be delivered.

It's also hilariously opaque to configure, which is probably why nobody uses it 
- the RED problem again - and the top link when I Googled for best practice on 
it gushes enthusiastically about Linux 2.2!  The idea of manually specifying an 
average packet size in particular feels intuitively wrong to me.  Still, I 
might be able to try it later on.

Most class-based shapers are probably more complex to set up for simple needs 
than they need to be.  I have to issue three separate 'tc' invocations for a 
minimal configuration of each of them, repeating several items of data between 
them.  They scale up reasonably well to complex situations, but such uses are 
relatively rare.

 In my quest for absolutely minimal latency I'd love to be rid of that
 last extra non-in-the-fq_codel-qdisc packet... either with a peek
 operation or with a running estimate.

I suspect that something like fq_codel which included its own shaper (with the 
knobs set sensibly by default) would gain more traction via ease of use - and 
might even answer your wish.

 It would be cool to be able to program the ethernet hardware itself to
 return completion interrupts at a given transmit rate (so you could
 program the hardware to be any bandwidth not just 10/100/1000). Some
 hardware so far as I know supports this with a pacing feature.

Is there a summary of hardware features like this anywhere?  It'd be nice to 
see what us GEM and RTL proles are missing out on.  :-)

 Actually, I think most of the CPU load is due to overheads in the 
 userspace-kernel interface and the device driver, rather than the qdiscs 
 themselves.
 
 You will see it bound by the softirq thread, but, what, exactly,
 inside that, is kind of unknown. (I presently lack time to build up
 profilable kernels on these low end arches. )

When I eventually got RRUL running (on one of the AMD boxes, so the PowerBook 
only has to run the server end of netperf), the bandwidth maxed out at about 
300Mbps each way, and the softirq was bouncing around 60% CPU.  I'm pretty sure 
most of that is shoving stuff across the PCI bus (even though it's internal to 
the northbridge), or at least waiting for it to go there.  I'm happy to assume 
that the rest was mostly kernel-userspace interface overhead to the netserver 
instances.

But this doesn't really answer the question of why the WNDR has so much lower a 
ceiling with shaping than without.  The G4 is powerful enough that the overhead 
of shaping simply disappears next to the overhead of shoving data around.  Even 
when I turn up the shaping knob to a value quite close to the hardware's 
unshaped capabilities (eg. 400Mbps one-way), most of the shapers stick to the 
requested limit like glue, and even the worst offender is within 10%.  I 
estimate that it's using only about 500 clocks per packet *unless* it saturates 
the PCI bus.

It's possible, however, that we're not really looking at a CPU limitation, but 
a timer problem.  The PowerBook is a proper desktop computer with hardware to 
match (modulo its age).  If all the shapers now depend on the high-resolution 
timer, how high-resolution is the WNDR's timer?

 - Jonathan Morton

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Comcast upped service levels - WNDR3800 can't cope...

2014-09-01 Thread Dave Taht
On Mon, Sep 1, 2014 at 11:06 AM, Jonathan Morton chromati...@gmail.com wrote:

 On 1 Sep, 2014, at 8:01 pm, Dave Taht wrote:

 On Sun, Aug 31, 2014 at 3:18 AM, Jonathan Morton chromati...@gmail.com 
 wrote:

 On 31 Aug, 2014, at 1:30 am, Dave Taht wrote:

 Could I get you to also try HFSC?

 Once I got a kernel running that included it, and figured out how to make 
 it do what I wanted...

 ...it seems to be indistinguishable from HTB and FQ in terms of CPU load.

 If you are feeling really inspired, try cbq. :) One thing I sort of like 
 about cbq is that it (I think)
 (unlike htb presently) operates off an estimated size for the next packet 
 (which isn't dynamic, sadly),
 where the others buffer up an extra packet until they can be delivered.

 It's also hilariously opaque to configure, which is probably why nobody uses 
 it - the RED problem again - and the top link when I Googled for best 
 practice on it gushes enthusiastically about Linux 2.2!  The idea of manually 
 specifying an average packet size in particular feels intuitively wrong to 
 me.  Still, I might be able to try it later on.

I felt a ewma of egress packet sizes would be a better estimator, yes.

 Most class-based shapers are probably more complex to set up for simple needs 
 than they need to be.  I have to issue three separate 'tc' invocations for a 
 minimal configuration of each of them, repeating several items of data 
 between them.
 They scale up reasonably well to complex situations, but such uses are 
 relatively rare.

 In my quest for absolutely minimal latency I'd love to be rid of that
 last extra non-in-the-fq_codel-qdisc packet... either with a peek
 operation or with a running estimate.

 I suspect that something like fq_codel which included its own shaper (with 
 the knobs set sensibly by default) would gain more traction via ease of use - 
 and might even answer your wish.

I agree that a simpler to use qdisc would be good. I'd like something
that preserves multiple (3-4) service classes (as pfifo_fast and
sch_fq do) using drr, deals with diffserv, and could be invoked with a
command line like:

tc qdisc add dev eth0 cake bandwidth 50mbit diffservmap std

I had started at that (basically pouring cerowrt's simple.qos code
into C with a simple lookup table for diffserv) many moons ago, but
the contents of the yurtlab and that code was stolen - and I was (and
remain) completely stuck on how to do soft rate limiting saner,
particularly in asymmetric scenarios.

(cake stood for Common Applications Kept Enhanced. fq_codel is not
a drop in replacement for pfifo_fast due to the classless nature of
it. sch_fq comes closer, but it's more server oriented. QFQ with 4
weighted bands + fq_codel can be made to do the levels of service
stuff fairly straight forwardly at line rate, but the tc filter code
tends to get rather long to handle all the diffserv classes...

So... we keep polishing the sqm system, and I keep tracking progress
in how diffserv classification will be done in the future (in ietf
groups like rmcat and dart), and figuring out how to deal better with
aggregating macs in general is what keeps me awake nights, more than
finishing cake...

We'll get there, eventually.

 It would be cool to be able to program the ethernet hardware itself to
 return completion interrupts at a given transmit rate (so you could
 program the hardware to be any bandwidth not just 10/100/1000). Some
 hardware so far as I know supports this with a pacing feature.

 Is there a summary of hardware features like this anywhere?  It'd be nice to 
 see what us GEM and RTL proles are missing out on.  :-)

I'd like one. There are certain 3rd party firmwares like octeon's
where it seems possible to add more features to the firmware
co-processor, in particular.


 Actually, I think most of the CPU load is due to overheads in the 
 userspace-kernel interface and the device driver, rather than the qdiscs 
 themselves.

 You will see it bound by the softirq thread, but, what, exactly,
 inside that, is kind of unknown. (I presently lack time to build up
 profilable kernels on these low end arches. )

 When I eventually got RRUL running (on one of the AMD boxes, so the PowerBook 
 only has to run the server end of netperf), the bandwidth maxed out at about 
 300Mbps each way, and the softirq was bouncing around 60% CPU.  I'm pretty 
 sure most of that is shoving stuff across the PCI bus (even though it's 
 internal to the northbridge), or at least waiting for it to go there.  I'm 
 happy to assume that the rest was mostly kernel-userspace interface overhead 
 to the netserver instances.

perf and the older oprofile are our friends here.

 But this doesn't really answer the question of why the WNDR has so much lower 
 a ceiling with shaping than without.  The G4 is powerful enough that the 
 overhead of shaping simply disappears next to the overhead of shoving data 
 around.  Even when I turn up the shaping knob to a value quite close to the 
 hardware's 

Re: [Bloat] Comcast upped service levels - WNDR3800 can't cope...

2014-09-01 Thread Aaron Wood
  But this doesn't really answer the question of why the WNDR has so much
 lower a ceiling with shaping than without.  The G4 is powerful enough that
 the overhead of shaping simply disappears next to the overhead of shoving
 data around.  Even when I turn up the shaping knob to a value quite close
 to the hardware's unshaped capabilities (eg. 400Mbps one-way), most of the
 shapers stick to the requested limit like glue, and even the worst offender
 is within 10%.  I estimate that it's using only about 500 clocks per packet
 *unless* it saturates the PCI bus.
 
  It's possible, however, that we're not really looking at a CPU
 limitation, but a timer problem.  The PowerBook is a proper desktop
 computer with hardware to match (modulo its age).  If all the shapers now
 depend on the high-resolution timer, how high-resolution is the WNDR's
 timer?

 Both good questions worth further exploration.


Doing some napkin math and some spec reading, I think that the memory bus
is a likely factory.  The G4 had a fairly impressive memory bus for the day
(64-bit?).  The WNDR3800 appears to be used in an x16 configuration (based
on the numbers on the memory parts).  It may have *just* enough bw to push
concurrent 3x3 802.11n through the software bridge interface, which
short-circuits a lot of processing (IIRC).

The typical way I've seen a home router being benchmarked for the
marketing numbers is to flow tcp data to/from a wifi client to a wired
client.  Single socket is used, for a uni-directional stream of data.  So
long as they can hit peak rates (peak MCS), it will get marked as good for
up to 900Mbps!! or whatever they want to say.

The small cache of the AR7161 vs. the G4 is another issue (32KB vs. 2MB)
the various buffers for fq_codel and htb may stay in L2 on the G4, but
there simply isn't room in the AR7161 for that, which puts further pressure
on the bus.

-Aaron
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Comcast upped service levels - WNDR3800 can't cope...

2014-09-01 Thread Jonathan Morton

On 1 Sep, 2014, at 11:25 pm, Aaron Wood wrote:

 But this doesn't really answer the question of why the WNDR has so much 
 lower a ceiling with shaping than without.  The G4 is powerful enough that 
 the overhead of shaping simply disappears next to the overhead of shoving 
 data around.  Even when I turn up the shaping knob to a value quite close 
 to the hardware's unshaped capabilities (eg. 400Mbps one-way), most of the 
 shapers stick to the requested limit like glue, and even the worst offender 
 is within 10%.  I estimate that it's using only about 500 clocks per packet 
 *unless* it saturates the PCI bus.
 
 It's possible, however, that we're not really looking at a CPU limitation, 
 but a timer problem.  The PowerBook is a proper desktop computer with 
 hardware to match (modulo its age).  If all the shapers now depend on the 
 high-resolution timer, how high-resolution is the WNDR's timer?

 Both good questions worth further exploration.

 Doing some napkin math and some spec reading, I think that the memory bus is 
 a likely factory.  The G4 had a fairly impressive memory bus for the day 
 (64-bit?).  The WNDR3800 appears to be used in an x16 configuration (based on 
 the numbers on the memory parts).  It may have *just* enough bw to push 
 concurrent 3x3 802.11n through the software bridge interface, which 
 short-circuits a lot of processing (IIRC).   
 
 The typical way I've seen a home router being benchmarked for the marketing 
 numbers is to flow tcp data to/from a wifi client to a wired client.  Single 
 socket is used, for a uni-directional stream of data.  So long as they can 
 hit peak rates (peak MCS), it will get marked as good for up to 900Mbps!! 
 or whatever they want to say.
 
 The small cache of the AR7161 vs. the G4 is another issue (32KB vs. 2MB) the 
 various buffers for fq_codel and htb may stay in L2 on the G4, but there 
 simply isn't room in the AR7161 for that, which puts further pressure on the 
 bus.

I don't think that's it.

First a nitpick: the PowerBook version of the late-model G4 (7447A) doesn't 
have the external L3 cache interface, so it only has the 256KB or 512KB 
internal L2 cache (I forget which).  The desktop version (7457A) used external 
cache.  The G4 was considered to be *crippled* by its FSB by the end of its 
run, since it never adopted high-performance signalling techniques, nor moved 
the memory controller on-die; it was quoted that the G5 (970) could move data 
using *single-byte* operations faster than the *peak* throughput of the G4's 
FSB.  The only reason the G5 never made it into a PowerBook was because it 
wasn't battery-friendly in the slightest.

But that makes little difference to your argument - compared to a cheap 
CPE-class embedded SoC, the PowerBook is eminently desktop-class hardware, even 
if it is already a decade old.

More compelling is that even at 16-bit width, the WNDR's RAM should have more 
bandwidth than my PowerBook's PCI bus.  Standard PCI is 33MHz x 32-bit, and I 
can push a steady 30MB/sec in both directions simultaneously, which corresponds 
in total to about half the PCI bus's theoretical capacity.  (The GEM reports 
66MHz capability, but it shares the bus with an IDE controller which doesn't, 
so I assume it is stuck at 33MHz.)  A 16-bit RAM should be able to match PCI if 
it runs at 66MHz, which is the lower limit of JEDEC standards for SDRAM.

The AR7161 datasheet says it has a DDR-capable SDRAM interface, which implies 
at least 200MHz unless the integrator was colossally stingy.  Further, a little 
digging suggests that the memory bus should be 32-bit wide (hence two 16-bit 
RAM chips), and that the WNDR runs it at 340MHz, half the CPU core speed.  For 
an embedded SoC, that's really not too bad - it should be able to sustain 
1GB/sec, in one direction at a time.

So that takes care of the argument for simply moving the payload around.  In 
any case, the WNDR demonstrably *can* cope with the available bandwidth if the 
shaping is turned off.

For the purposes of shaping, the CPU shouldn't need to touch the majority of 
the payload - only the headers, which are relatively small.  The bulk of the 
payload should DMA from one NIC to RAM, then DMA back out of RAM to the other 
NIC.  It has to do that anyway to route them, and without shaping there'd be 
more of them to handle.  The difference might be in the data structures used by 
the shaper itself, but I think those are also reasonably compact.  It doesn't 
even have to touch userspace, since it's not acting as the endpoint as my 
PowerBook was during my tests.

And while the MIPS 24K core is old, it's also been die-shrunk over the 
intervening years, so it runs a lot faster than it originally did.  I very much 
doubt that it's as refined as my G4, but it could probably hold its own 
relative to a comparable ARM SoC such as the Raspberry Pi.  (Unfortunately, the 
latter doesn't have the I/O capacity to do high-speed networking - USB only.)  
Atheros publicity