Testing network / NAT performance
Over years I saw multiple reports that new OpenWrt release / kernel update / netifd change / DSA introduction caused a regression in router network / NAT speed (masquerade NAT in most cases). Most of those reports remained unresolved I believe. The problem is that: 1. OpenWrt doesn't have automated testing environments 2. Developers can't figure anything from undetailed reports 3. Even experienced users don't know how to do proper debugging I spent almost 2 last months researching & testing masquerade NAT performance. I thought I'll share my find outs & results. Hopefully this will get more people involved in tracing & fixing such regressions. * * Testing method * In 99% cases it's a totally bad idea to use online speed test services. They may be too unreliable. It's better to setup a local server instead. For actual testing you may use iperf or iperf3. If needed - for some reason - FTP, HTTP or another protocol may be an option too. * * Testing results * Network traffic is often not perfectly stable. To avoid getting false results it may be worth to: 1. Repeat test in few sessions 2. Reject lowest & highest results 3. Calculate an average speed Example of my testing: for i in $(seq 1 5); do date iperf -t 80 -i 10 -c 192.168.99.1 | head -n -1 | sed -n 's/.* \([0-9][0-9]*\) Mbits\/sec.*/\1/p' | sort -n echo sleep 15 done Above script lists 8 results from each iperf session. Later I get middle 4 and calculate avarage from them. Then I calculate average from all 5 sessions. It may be an overkill but it was meant to deal with some really unstable cases. * * Environment setup * Get some (usually 2) PCs powerful enough to easily handle maximum expected router traffic. Once setup avoid changing anything. Kernel update or configuration change on PC may affect results even if router is a bottleneck [1]. Disable power saving - I noticed once a lower performance whenever screen saver got activated. Connect PC to WAN port and setup it to use a static IP. You may setup DHCP server too or just make OpenWrt use static WAN IP & gateway. Start iperf / FTP / HTTP / whatever server. Connect another PC to LAN port and install a matching client for generating network traffic. * * OpenWrt customizations * Depending on setup you may need some custom configuration changes. To avoid applying them manually on every boot use uci-defaults scripts. Example of my WAN setup: mkdir -p files/etc/uci-defaults/ cat << EOF > files/etc/uci-defaults/90-nat.sh #!/bin/sh uci set network.wan.proto='static' uci set network.wan.ipaddr='192.168.99.2' uci set network.wan.netmask='255.255.255.0' EOF * * Finding regressions * In continuous testing pick an interval (every day testing or every n-th commit testing) and look for regressions. If you notice a regression the first step is to find the first bad commit. End users often assume that regression was caused by a kernel change as that is the simplest difference to notice. Always find exact commit. Make sure to use git bisect [2] for finding first bad commits. * * Stabilizng performance * Probably the most annoying problem in debugging are unstable results. Speed changing between testing sessions / reboots / recompilations makes the whole testing unreliable and makes it hard to find a real regression. Below are few tips that may help stabilizing network speeds. 1. Repeat tests and get average Explained above. 2. Don't change environment setup Explained above. 3. Use pfifo qdisc It should be more stable for simple traffic (e.g. iperf generated). Include "tc" package and execute something like: tc qdisc replace dev eth0 root pfifo Verify with: tc qdisc 4. Adjust rps_cpus and xps_cpus On multi-CPU devices having multiple CPUs assigned to a single network device may result in traffic being assigned to random CPU and in varying speeds across testing sessions. 5. Disable CONFIG_SMP This will likely reduce performance but may help finding regression if testing results vary a lot. 6. Organizing kernel symbols CPUs of home routers usually have small caches. The way kernel symbols get organized during compilation may significantly affect network performance [3]. It's especially annoying as network unrelated changes may move / reorder symbols and affect cache hits & misses. There isn't a reliable solution for that. It may help to add: -falign-functions=32 or -falign-functions=64 (depending on platform). using e.g. KBUILD_CFLAGS. * * Profiling * Profiling with "perf" [4] allows checking what consumes CPUs. It's very useful for finding cod
Re: Testing network / NAT performance
During last years NAT performance on Northstar (bcm53xx) changed multiple times. Noone keeps a close eye on this and Northstar testing results also seem very unstable. During last 2 months I probably tested over a hundred of OpenWrt commits going back to 2015. I decided to do testing with -falign-functions=32 and at some point I disabled CONFIG_SMP. I also did some tests without rtcache patch which was dropped later anyway. Below I'm sharing my notes. 1. afafbc0d7454 ("kernel: bgmac: add more DMA related fixes") This commit introduced varying speeds across testing sessions. It seems that could be caused by the removal of dma_sync_single_for_cpu() which could make rps_cpus actually work as expected. 2. 39f115707531 ("bcm53xx: switch to kernel 4.4") Kernel 4.2 introduced commit 66e5133f19e9 ("vlan: Add GRO support for non hardware accelerated vlan") which lowered Northstar / bgmac performance as it introduced csum_partial() calls in new code paths [1]. Regression can be workarounded with: ethtool -K eth0 gro off (note: DSA requires disabling GRO also for switch ports) 3. 916e33fa1e14 ("netifd: update to the latest version, rewrite RPS/XPS handling") This affected setting rps_cpus and xps_cpus default values. It affected networking depending on amount of device CPUs and setup. 4. 50c6938b95a0 ("bcm53xx: add v5.4 support") This commit actually switched bcm53xx from kernel 4.14 to 4.19 which somehow dropped network speed by 5%. It could be actual net subsystem change or just something unrelated. Too small difference to make whole debugging worth it. 5. 7125323b81d7 ("bcm53xx: switch to kernel 5.4") Improved network speed by 25% (256 Mb/s → 320 Mb/s). I didn't have time to bisect this *improvement* to a single kernel commit. I tried profiling but it isn't obvious to me what caused that improvement. Kernel 4.19: 11.94% ksoftirqd/0 [kernel.kallsyms] [k] v7_dma_inv_range 7.06% ksoftirqd/0 [kernel.kallsyms] [k] l2c210_inv_range 3.37% ksoftirqd/0 [kernel.kallsyms] [k] v7_dma_clean_range 2.80% ksoftirqd/0 [kernel.kallsyms] [k] l2c210_clean_range 2.67% ksoftirqd/0 [kernel.kallsyms] [k] bgmac_poll 2.63% ksoftirqd/0 [kernel.kallsyms] [k] __dev_queue_xmit 2.43% ksoftirqd/0 [kernel.kallsyms] [k] __netif_receive_skb_core 2.13% ksoftirqd/0 [kernel.kallsyms] [k] bgmac_start_xmit 1.82% ksoftirqd/0 [kernel.kallsyms] [k] nf_hook_slow 1.54% ksoftirqd/0 [kernel.kallsyms] [k] ip_forward 1.50% ksoftirqd/0 [kernel.kallsyms] [k] dma_cache_maint_page Kernel 5.4: 14.53% ksoftirqd/0 [kernel.kallsyms] [k] v7_dma_inv_range 8.02% ksoftirqd/0 [kernel.kallsyms] [k] l2c210_inv_range 3.32% ksoftirqd/0 [kernel.kallsyms] [k] bgmac_poll 3.28% ksoftirqd/0 [kernel.kallsyms] [k] v7_dma_clean_range 3.12% ksoftirqd/0 [kernel.kallsyms] [k] __netif_receive_skb_core 2.70% ksoftirqd/0 [kernel.kallsyms] [k] l2c210_clean_range 2.46% ksoftirqd/0 [kernel.kallsyms] [k] __dev_queue_xmit 2.26% ksoftirqd/0 [kernel.kallsyms] [k] bgmac_start_xmit 1.73% ksoftirqd/0 [kernel.kallsyms] [k] __dma_page_dev_to_cpu 1.72% ksoftirqd/0 [kernel.kallsyms] [k] nf_hook_slow 6. ba72ed537c4a ("kernel: backport GRO improvements") Improved network speed by 10%. 7. 17576b1b2aea ("kernel: drop the conntrack rtcache patch") Dropped network speed by 15%. 8. f55f1dbaad33 ("bcm53xx: switch to the kernel 5.10") Kernel bump that introduced upstream commit 8c7da63978f1 ("bgmac: configure MTU and add support for frames beyond 8192 byte size") which dropped speed by 49%. 9. e9672b1a8fa4 ("bcm53xx: switch to the upstream DSA-based b53 driver") At first it seemed like a decrease of network performance by 5%. Profiling has revealed it was caused by an added csum_partial() call. Further debugging showed it was tcp4_gro_receive() that started calling ti. Long story short: with DSA GRO needs disabling on all switch interfaces. After some further testing it seems DSA actually bumped network speed from 404 Mb/s to 445 Mb/s. From profiling it again isn't clear why. swconfig: 13.46% ksoftirqd/0 [kernel.kallsyms][k] v7_dma_inv_range 7.39% ksoftirqd/0 [kernel.kallsyms][k] l2c210_inv_range 3.27% ksoftirqd/0 [kernel.kallsyms][k] v7_dma_clean_range 2.74% ksoftirqd/0 [kernel.kallsyms][k] __netif_receive_skb_core.constprop.0 2.72% ksoftirqd/0 [kernel.kallsyms][k] l2c210_clean_range 2.71% ksoftirqd/0 [kernel.kallsyms][k] bgmac_poll 2.56% ksoftirqd/0 [kernel.kallsyms][k] bgmac_start_xmit 2.31% ksoftirqd/0 [kernel.kallsyms][k] fib_table_lookup 1.91% ksoftirqd/0 [kernel.kallsyms][k] ip_route_inp
Re: Testing network / NAT performance
On 12.06.2022 21:58, Rafał Miłecki wrote: 5. 7125323b81d7 ("bcm53xx: switch to kernel 5.4") Improved network speed by 25% (256 Mb/s → 320 Mb/s). I didn't have time to bisect this *improvement* to a single kernel commit. I tried profiling but it isn't obvious to me what caused that improvement. Kernel 4.19: 11.94% ksoftirqd/0 [kernel.kallsyms] [k] v7_dma_inv_range 7.06% ksoftirqd/0 [kernel.kallsyms] [k] l2c210_inv_range 3.37% ksoftirqd/0 [kernel.kallsyms] [k] v7_dma_clean_range 2.80% ksoftirqd/0 [kernel.kallsyms] [k] l2c210_clean_range 2.67% ksoftirqd/0 [kernel.kallsyms] [k] bgmac_poll 2.63% ksoftirqd/0 [kernel.kallsyms] [k] __dev_queue_xmit 2.43% ksoftirqd/0 [kernel.kallsyms] [k] __netif_receive_skb_core 2.13% ksoftirqd/0 [kernel.kallsyms] [k] bgmac_start_xmit 1.82% ksoftirqd/0 [kernel.kallsyms] [k] nf_hook_slow 1.54% ksoftirqd/0 [kernel.kallsyms] [k] ip_forward 1.50% ksoftirqd/0 [kernel.kallsyms] [k] dma_cache_maint_page Kernel 5.4: 14.53% ksoftirqd/0 [kernel.kallsyms] [k] v7_dma_inv_range 8.02% ksoftirqd/0 [kernel.kallsyms] [k] l2c210_inv_range 3.32% ksoftirqd/0 [kernel.kallsyms] [k] bgmac_poll 3.28% ksoftirqd/0 [kernel.kallsyms] [k] v7_dma_clean_range 3.12% ksoftirqd/0 [kernel.kallsyms] [k] __netif_receive_skb_core 2.70% ksoftirqd/0 [kernel.kallsyms] [k] l2c210_clean_range 2.46% ksoftirqd/0 [kernel.kallsyms] [k] __dev_queue_xmit 2.26% ksoftirqd/0 [kernel.kallsyms] [k] bgmac_start_xmit 1.73% ksoftirqd/0 [kernel.kallsyms] [k] __dma_page_dev_to_cpu 1.72% ksoftirqd/0 [kernel.kallsyms] [k] nf_hook_slow Riddle solved. Change to bless/blame: 4e0c54bc5bc8 ("kernel: add support for kernel 5.4"). First of all bcm53xx uses CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y OpenWrt's kernel Makefile in kernel 4.19: ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE KBUILD_CFLAGS += -Os $(EXTRA_OPTIMIZATION) else KBUILD_CFLAGS += -O2 -fno-reorder-blocks -fno-tree-ch $(EXTRA_OPTIMIZATION) endif OpenWrt's kernel Makefile in 5.4: ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE KBUILD_CFLAGS += -O2 $(EXTRA_OPTIMIZATION) else ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3 KBUILD_CFLAGS += -O3 $(EXTRA_OPTIMIZATION) else ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE KBUILD_CFLAGS += -Os -fno-reorder-blocks -fno-tree-ch $(EXTRA_OPTIMIZATION) endif As you can see 4e0c54bc5bc8 has accidentally moved -fno-reorder-blocks from !CONFIG_CC_OPTIMIZE_FOR_SIZE to CONFIG_CC_OPTIMIZE_FOR_SIZE. I've noticed problem with -fno-reorder-blocks long time ago, see: [PATCH RFC] kernel: drop -fno-reorder-blocks https://patchwork.ozlabs.org/project/openwrt/patch/20190409093046.13401-1-zaj...@gmail.com/ It should really get sorted out... ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Fwd: Testing network / NAT performance
[Ugh, now with less HTML, sorry about that…] Hi, Rafał, On Tue, 14 Jun 2022 at 14:20, Rafał Miłecki wrote: > > As you can see 4e0c54bc5bc8 has accidentally moved -fno-reorder-blocks > from !CONFIG_CC_OPTIMIZE_FOR_SIZE to CONFIG_CC_OPTIMIZE_FOR_SIZE. > > I've noticed problem with -fno-reorder-blocks long time ago, see: > [PATCH RFC] kernel: drop -fno-reorder-blocks > https://patchwork.ozlabs.org/project/openwrt/patch/20190409093046.13401-1-zaj...@gmail.com/ > > It should really get sorted out... Why not just drop both fno-reorder-blocks and -fno-tree-ch? I have no idea about the details, but those options seem to have been carried forward from a time where GCC probably had issues with them (code bloat, maybe). I've been carrying a patch in my tree for (about three) years, dropping both options, with no issues at all in all architectures (ARM1176JZF-S, 24Kc, 74Kc, 1004Kc, Cortex-A9, Cortex-A53, x86-64) and GCC versions (8, 9, 10, 11, 12) I've tested. Cheers, Rui ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Re: Testing network / NAT performance
Hi Rafal, Thank you for your detailed analyses and also for the detailed report. This is very helpful when I ran into this problem. Can we somehow automate it so that we get notified a day after a bad change was committed about performance regression and not one year after? On 6/14/22 15:16, Rafał Miłecki wrote: On 12.06.2022 21:58, Rafał Miłecki wrote: 5. 7125323b81d7 ("bcm53xx: switch to kernel 5.4") Improved network speed by 25% (256 Mb/s → 320 Mb/s). I didn't have time to bisect this *improvement* to a single kernel commit. I tried profiling but it isn't obvious to me what caused that improvement. Kernel 4.19: 11.94% ksoftirqd/0 [kernel.kallsyms] [k] v7_dma_inv_range 7.06% ksoftirqd/0 [kernel.kallsyms] [k] l2c210_inv_range 3.37% ksoftirqd/0 [kernel.kallsyms] [k] v7_dma_clean_range 2.80% ksoftirqd/0 [kernel.kallsyms] [k] l2c210_clean_range 2.67% ksoftirqd/0 [kernel.kallsyms] [k] bgmac_poll 2.63% ksoftirqd/0 [kernel.kallsyms] [k] __dev_queue_xmit 2.43% ksoftirqd/0 [kernel.kallsyms] [k] __netif_receive_skb_core 2.13% ksoftirqd/0 [kernel.kallsyms] [k] bgmac_start_xmit 1.82% ksoftirqd/0 [kernel.kallsyms] [k] nf_hook_slow 1.54% ksoftirqd/0 [kernel.kallsyms] [k] ip_forward 1.50% ksoftirqd/0 [kernel.kallsyms] [k] dma_cache_maint_page Kernel 5.4: 14.53% ksoftirqd/0 [kernel.kallsyms] [k] v7_dma_inv_range 8.02% ksoftirqd/0 [kernel.kallsyms] [k] l2c210_inv_range 3.32% ksoftirqd/0 [kernel.kallsyms] [k] bgmac_poll 3.28% ksoftirqd/0 [kernel.kallsyms] [k] v7_dma_clean_range 3.12% ksoftirqd/0 [kernel.kallsyms] [k] __netif_receive_skb_core 2.70% ksoftirqd/0 [kernel.kallsyms] [k] l2c210_clean_range 2.46% ksoftirqd/0 [kernel.kallsyms] [k] __dev_queue_xmit 2.26% ksoftirqd/0 [kernel.kallsyms] [k] bgmac_start_xmit 1.73% ksoftirqd/0 [kernel.kallsyms] [k] __dma_page_dev_to_cpu 1.72% ksoftirqd/0 [kernel.kallsyms] [k] nf_hook_slow Riddle solved. Change to bless/blame: 4e0c54bc5bc8 ("kernel: add support for kernel 5.4"). First of all bcm53xx uses CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y OpenWrt's kernel Makefile in kernel 4.19: ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE KBUILD_CFLAGS += -Os $(EXTRA_OPTIMIZATION) else KBUILD_CFLAGS += -O2 -fno-reorder-blocks -fno-tree-ch $(EXTRA_OPTIMIZATION) endif OpenWrt's kernel Makefile in 5.4: ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE KBUILD_CFLAGS += -O2 $(EXTRA_OPTIMIZATION) else ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3 KBUILD_CFLAGS += -O3 $(EXTRA_OPTIMIZATION) else ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE KBUILD_CFLAGS += -Os -fno-reorder-blocks -fno-tree-ch $(EXTRA_OPTIMIZATION) endif As you can see 4e0c54bc5bc8 has accidentally moved -fno-reorder-blocks from !CONFIG_CC_OPTIMIZE_FOR_SIZE to CONFIG_CC_OPTIMIZE_FOR_SIZE. This looks like an accident to me. All targets except mediatek/mt7629 are setting CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE in master. In Openwrt 21.02 the ARCHS38 target set CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3, but now it is also to normal performance. We should probably switch mediatek/mt7629 to CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE, does anyone have such a device and could test a patch? I've noticed problem with -fno-reorder-blocks long time ago, see: [PATCH RFC] kernel: drop -fno-reorder-blocks https://patchwork.ozlabs.org/project/openwrt/patch/20190409093046.13401-1-zaj...@gmail.com/ It should really get sorted out... I would suggest to remove the -fno-reorder-blocks -fno-tree-ch options as they are not used. The next step could be Profile-guided optimization: https://lwn.net/Articles/830300/ If the toolchain works properly I expect there big improvements as routing, forwarding and NAT is completely in the kernel and we use devices with small caches. Profile-guided optimization should be able to avoid many cache misses by better packaging the binary. Hauke ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel
Re: Testing network / NAT performance
Il giorno ven 17 giu 2022 alle ore 13:51 Hauke Mehrtens ha scritto: > > Hi Rafal, > > Thank you for your detailed analyses and also for the detailed report. > This is very helpful when I ran into this problem. > > Can we somehow automate it so that we get notified a day after a bad > change was committed about performance regression and not one year after? > > On 6/14/22 15:16, Rafał Miłecki wrote: > > On 12.06.2022 21:58, Rafał Miłecki wrote: > >> 5. 7125323b81d7 ("bcm53xx: switch to kernel 5.4") > >> > >> Improved network speed by 25% (256 Mb/s → 320 Mb/s). > >> > >> I didn't have time to bisect this *improvement* to a single kernel > >> commit. I tried profiling but it isn't obvious to me what caused that > >> improvement. > >> > >> Kernel 4.19: > >> 11.94% ksoftirqd/0 [kernel.kallsyms] [k] > >> v7_dma_inv_range > >> 7.06% ksoftirqd/0 [kernel.kallsyms] [k] > >> l2c210_inv_range > >> 3.37% ksoftirqd/0 [kernel.kallsyms] [k] > >> v7_dma_clean_range > >> 2.80% ksoftirqd/0 [kernel.kallsyms] [k] > >> l2c210_clean_range > >> 2.67% ksoftirqd/0 [kernel.kallsyms] [k] bgmac_poll > >> 2.63% ksoftirqd/0 [kernel.kallsyms] [k] > >> __dev_queue_xmit > >> 2.43% ksoftirqd/0 [kernel.kallsyms] [k] > >> __netif_receive_skb_core > >> 2.13% ksoftirqd/0 [kernel.kallsyms] [k] > >> bgmac_start_xmit > >> 1.82% ksoftirqd/0 [kernel.kallsyms] [k] nf_hook_slow > >> 1.54% ksoftirqd/0 [kernel.kallsyms] [k] ip_forward > >> 1.50% ksoftirqd/0 [kernel.kallsyms] [k] > >> dma_cache_maint_page > >> > >> Kernel 5.4: > >> 14.53% ksoftirqd/0 [kernel.kallsyms] [k] > >> v7_dma_inv_range > >> 8.02% ksoftirqd/0 [kernel.kallsyms] [k] > >> l2c210_inv_range > >> 3.32% ksoftirqd/0 [kernel.kallsyms] [k] bgmac_poll > >> 3.28% ksoftirqd/0 [kernel.kallsyms] [k] > >> v7_dma_clean_range > >> 3.12% ksoftirqd/0 [kernel.kallsyms] [k] > >> __netif_receive_skb_core > >> 2.70% ksoftirqd/0 [kernel.kallsyms] [k] > >> l2c210_clean_range > >> 2.46% ksoftirqd/0 [kernel.kallsyms] [k] > >> __dev_queue_xmit > >> 2.26% ksoftirqd/0 [kernel.kallsyms] [k] > >> bgmac_start_xmit > >> 1.73% ksoftirqd/0 [kernel.kallsyms] [k] > >> __dma_page_dev_to_cpu > >> 1.72% ksoftirqd/0 [kernel.kallsyms] [k] nf_hook_slow > > > > Riddle solved. Change to bless/blame: 4e0c54bc5bc8 ("kernel: add support > > for kernel 5.4"). > > > > First of all bcm53xx uses > > CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y > > > > > > OpenWrt's kernel Makefile in kernel 4.19: > > > > ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE > > KBUILD_CFLAGS+= -Os $(EXTRA_OPTIMIZATION) > > else > > KBUILD_CFLAGS += -O2 -fno-reorder-blocks -fno-tree-ch > > $(EXTRA_OPTIMIZATION) > > endif > > > > > > OpenWrt's kernel Makefile in 5.4: > > > > ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE > > KBUILD_CFLAGS += -O2 $(EXTRA_OPTIMIZATION) > > else ifdef CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3 > > KBUILD_CFLAGS += -O3 $(EXTRA_OPTIMIZATION) > > else ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE > > KBUILD_CFLAGS += -Os -fno-reorder-blocks -fno-tree-ch $(EXTRA_OPTIMIZATION) > > endif > > > > > > As you can see 4e0c54bc5bc8 has accidentally moved -fno-reorder-blocks > > from !CONFIG_CC_OPTIMIZE_FOR_SIZE to CONFIG_CC_OPTIMIZE_FOR_SIZE. > > This looks like an accident to me. > All targets except mediatek/mt7629 are setting > CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE in master. In Openwrt 21.02 the > ARCHS38 target set CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3, but now it is > also to normal performance. > > We should probably switch mediatek/mt7629 to > CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE, does anyone have such a device and > could test a patch? > > > I've noticed problem with -fno-reorder-blocks long time ago, see: > > [PATCH RFC] kernel: drop -fno-reorder-blocks > > https://patchwork.ozlabs.org/project/openwrt/patch/20190409093046.13401-1-zaj...@gmail.com/ > > > > > > It should really get sorted out... > > I would suggest to remove the -fno-reorder-blocks -fno-tree-ch options > as they are not used. > > > The next step could be Profile-guided optimization: > https://lwn.net/Articles/830300/ > If the toolchain works properly I expect there big improvements as > routing, forwarding and NAT is completely in the kernel and we use > devices with small caches. Profile-guided optimization should be able to > avoid many cache misses by better packaging the binary. > PGO would be a dream to accomplish but it's a nightmare to actually use it. The kernel size grow a lot and it needs to be done correctly... Also AFAIK it's not that easy to add support for it and it's problematic for some device to generate the profile data. > Hauke > > ___ > openwrt-devel mailing list > openwrt
Re: Testing network / NAT performance
On 12.06.2022 21:58, Rafał Miłecki wrote: 6. Organizing kernel symbols CPUs of home routers usually have small caches. The way kernel symbols get organized during compilation may significantly affect network performance [3]. It's especially annoying as network unrelated changes may move / reorder symbols and affect cache hits & misses. There isn't a reliable solution for that. It may help to add: -falign-functions=32 or -falign-functions=64 (depending on platform). using e.g. KBUILD_CFLAGS. I'll provide an example of a really annoying behaviour I've just debugged. I noticed a NAT speed regression when switching from kernel 5.10 to 5.15. I narrowed it down to the 5.14 → 5.15 switch and then started bisecting process. I debugged that following commit: 4c00e1e2e58ee Merge tag 'linux-watchdog-5.15-rc1' of git://www.linux-watchdog.org/linux-watchdog dropped NAT speed from ~938 Mb/s down to 907 Mb/s. * Here comes interesting part: regression isn't present in the commit 41e73feb10249 ("dt-bindings: watchdog: Add compatible for Mediatek MT7986") - the last commit in the merged branch (tag). It means that merged code affects NAT performance only on top of the previous commit 192ad3c27a489 ("Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm"). I kept debugging and discovered that reverting dbe80cf471f94 ("watchdog: Start watchdog in watchdog_set_last_hw_keepalive only if appropriate") brings back high NAT speed. * Another interesting part: cherry-picking above commit on top of the 192ad3c27a489 ("Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm") does nothing (no NAT regression). Further debugging revealed another commit required to trigger regression: 60bcd91aafd22 ("watchdog: introduce watchdog_dev_suspend/resume"). Cherry-picking both on top of kvm affects NAT performance. * Finally (even more fun): 1. Cherry picking both commits on top of v5.14 does nothing (does not break NAT performance). 2. Reverting both commits from v5.15 doesn't fix regression. So all that watchdog thing is just some kind of a glitch. It makes debugging an actual regression a really painful process. It breaks reliability of automated testing. All of that happens with -falign-functions=32 and I'm not aware of any workaround for such issues. FWIW: actual regression seems to be caused by one of commits introduced by the 626bf91a292e2 ("Merge tag 'net-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net"). ___ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel