On a somewhat related note - I've just received my NZ/AU Region Almond+ which is an arm9 Dual core router based on the Cortina CSC SoC :
https://www.cortina-systems.com/product/digital-home-processors/16-products/996-cs7542-cs7522 More details : On 2 September 2014 21:27, Jonathan Morton <chromati...@gmail.com> wrote: > > On 2 Sep, 2014, at 1:14 am, Aaron Wood wrote: > >>> For the purposes of shaping, the CPU shouldn't need to touch the majority >>> of the payload - only the headers, which are relatively small. The bulk of >>> the payload should DMA from one NIC to RAM, then DMA back out of RAM to the >>> other NIC. It has to do that anyway to route them, and without shaping >>> there'd be more of them to handle. The difference might be in the data >>> structures used by the shaper itself, but I think those are also reasonably >>> compact. It doesn't even have to touch userspace, since it's not acting as >>> the endpoint as my PowerBook was during my tests. >> >> In an ideal case, yes. But is that how this gets managed? (I have no idea, >> I'm certainly not a kernel developer). > > It would be monumentally stupid to integrate two GigE MACs onto an SoC, and > then to call it a "network processor", without adequate DMA support. I don't > think Atheros are that stupid. > > Here's a more detailed datasheet: > > http://pdf.datasheetarchive.com/indexerfiles/Datasheets-SW6/DSASW00118777.pdf > > "Another memory factor is the ability to support multiple I/O operations in > parallel via the WNPU's various ports. The on-chip SRAM in AR7100 WNPUs has 5 > ports that enable simultaneous access to and from five sources: the two > gigabit Ethernet ports, the PCI port, the USB 2.0 port and the MIPS > processor." > > It's a reasonable question, however, whether the driver uses that support > properly. Mainline Linux kernel code seems to support the SoC but not the > Ethernet; if it were just a minor variant of some other Atheros hardware, I'd > have expected to see it integrated into one of the existing drivers. Or > maybe it is, and my greps just aren't showing it. > > At minimum, however, there are MMIO ranges reported for each MAC during > OpenWRT's boot sequence. That's where the ring buffers are. The most the > CPU has to do is read each packet from RAM and write it into those buffers, > or vice versa for receive - I think that's what my PowerBook has to do. > Ideally, a bog-standard DMA engine would take over that simple duty. Either > way, that's something that has to happen whether it's shaped or not, so it's > unlikely to be our problem. > > The same goes for the wireless MACs, incidentally. These are standard ath9k > mini-PCI cards, and the drivers *are* in mainline. There shouldn't be any > surprises with them. > >> If the packet data is getting moved about from buffer to buffer (for >> instance to do the htb calculations?) could that substantially change the >> processing load? > > The qdiscs only deal with packet and socket headers, not the full packet > data. Even then, they largely pass pointers around, inserting the headers > into linked lists rather than copying them into arrays. I believe a lot of > attention has been directed at cache-friendliness in this area, and the MIPS > caches are of conventional type. > >>> Which brings me back to the timers, and other items of black magic. >> >> Which would point to under-utilizing the processor core, while still having >> high load? (I'm not seeing that, I'm curious if that would be the case). > > It probably wouldn't manifest as high system load. Rather, poor timer > resolution or latency would show up as excessive delays between packets, > during which the CPU is idle. The packet egress times may turn out to be > quantised - that would be a smoking gun, if detectable. > >>> Incidentally, transfer speed benchmarks involving wireless will certainly >>> be limited by the wireless link. I assume that's not a factor here. >> >> That's the usual suspicion. But these are RF-chamber, short-range lab >> setups where the radios are running at full speed in perfect environments... > > Sure. But even turbocharged 'n' gear tops out at 450Mbps signalling, and > much less than that is available even theoretically for TCP/IP throughput. > My point is that you're probably not running *your* tests over wireless. > >> What this makes me realize is that I should go instrument the cpu stats with >> each of the various operating modes: >> >> * no shaping, anywhere >> * egress shaping >> * egress and ingress shaping at various limited levels: >> * 10Mbps >> * 20Mbps >> * 50Mbps >> * 100Mbps > > Smaller increments at the high end of the range may prove to be useful. I > would expect the CPU usage to climb nonlinearly (busy-waiting) if there's a > bottleneck in a peripheral device, such as the PCI bus. The way the kernel > classifies that usage may also be revealing. > >> Heck, what about running HTB simply from a 1ms timer instead of from a data >> driven timer? > > That might be what's already happening. We have to figure out that before we > can work out a solution. > > - Jonathan Morton > > _______________________________________________ > Cerowrt-devel mailing list > Cerowrt-devel@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/cerowrt-devel _______________________________________________ Cerowrt-devel mailing list Cerowrt-devel@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/cerowrt-devel