Weird traces 4.20.0-rc3+ / RIP: 0010:fib6_walk_continue+0x37/0xe6
Traces attached below: [310658.536190] rcu: INFO: rcu_sched self-detected stall on CPU [310658.536195] rcu: 15-: (322 ticks this GP) idle=fca/1/0x4002 softirq=50617185/50617185 fqs=64 [310658.536195] rcu: (t=15049 jiffies g=84272013 q=4728) [310658.536200] NMI backtrace for cpu 15 [310658.536203] CPU: 15 PID: 87 Comm: ksoftirqd/15 Tainted: G W 4.20.0-rc3+ #1 [310658.536204] Call Trace: [310658.536208] [310658.536214] dump_stack+0x46/0x5c [310658.536218] nmi_cpu_backtrace+0x72/0x81 [310658.536222] ? irq_force_complete_move+0x65/0x65 [310658.536224] nmi_trigger_cpumask_backtrace+0x4c/0xbf [310658.536228] rcu_dump_cpu_stacks+0x80/0xaa [310658.536231] rcu_check_callbacks+0x213/0x500 [310658.536234] ? tick_init_highres+0xe/0xe [310658.536237] update_process_times+0x23/0x47 [310658.536239] tick_sched_timer+0x102/0x13a [310658.536242] __hrtimer_run_queues+0x105/0x205 [310658.536244] ? ktime_get_update_offsets_now+0x31/0x8f [310658.536247] hrtimer_interrupt+0x85/0x177 [310658.536251] smp_apic_timer_interrupt+0x8c/0xff [310658.536253] apic_timer_interrupt+0xf/0x20 [310658.536254] [310658.536258] RIP: 0010:fib6_walk_continue+0x37/0xe6 [310658.536260] Code: 02 0f 0b 48 8b 43 18 48 85 c0 0f 84 c5 00 00 00 8b 53 28 83 fa 01 74 1e 72 0c 83 fa 02 74 3c 83 fa 03 74 6b eb e1 48 8b 50 08 <48> 85 d2 75 10 c7 43 28 01 00 00 00 48 8b 50 10 48 85 d2 74 0d 48 [310658.536261] RSP: 0018:c90003583a20 EFLAGS: 0297 ORIG_RAX: ff13 [310658.536262] RAX: 5103b800 RBX: c90003583a58 RCX: 5103b800 [310658.536263] RDX: RSI: 2868c1c0 RDI: 257dc500 [310658.536264] RBP: 820d8f00 R08: c90003583b18 R09: [310658.536265] R10: 07387eb0 R11: 07387eb0 R12: 820d9980 [310658.536266] R13: 817567ed R14: R15: c90003583b18 [310658.536268] ? call_fib6_entry_notifiers+0x59/0x59 [310658.536272] fib6_walk+0x59/0x76 [310658.536274] fib6_clean_tree+0x52/0x6c [310658.536276] ? fib6_del+0x1da/0x1da [310658.536278] ? call_fib6_entry_notifiers+0x59/0x59 [310658.536280] __fib6_clean_all+0x55/0x71 [310658.536282] fib6_run_gc+0x85/0xe6 [310658.536285] ip6_dst_gc+0x74/0xbf [310658.536288] dst_alloc+0x70/0x84 [310658.536290] ip6_dst_alloc+0x1c/0x59 [310658.536293] icmp6_dst_alloc+0x39/0xd9 [310658.536295] ndisc_send_skb+0x8e/0x274 [310658.536298] ? __kmalloc_reserve.isra.43+0x28/0x6a [310658.536300] ndisc_send_ns+0x135/0x15e [310658.536302] ? ndisc_solicit+0xdd/0x106 [310658.536304] ndisc_solicit+0xdd/0x106 [310658.536306] ? lock_timer_base+0x3d/0x61 [310658.536308] ? neigh_table_init+0x1f9/0x1f9 [310658.536310] ? neigh_probe+0x44/0x55 [310658.536312] neigh_probe+0x44/0x55 [310658.536314] neigh_timer_handler+0x192/0x1ca [310658.536316] call_timer_fn+0x51/0x125 [310658.536319] run_timer_softirq+0x13c/0x172 [310658.536322] ? __switch_to+0x16c/0x3be [310658.536324] __do_softirq+0xec/0x273 [310658.536329] ? sort_range+0x17/0x17 [310658.536331] run_ksoftirqd+0x13/0x1b [310658.536334] smpboot_thread_fn+0x123/0x138 [310658.536336] kthread+0xe5/0xea [310658.536338] ? kthread_destroy_worker+0x39/0x39 [310658.536340] ret_from_fork+0x1f/0x30 [310658.938348] ixgbe :06:00.1 enp6s0f1: initiating reset due to tx timeout [310660.685477] ixgbe :84:00.0 enp132s0f0: initiating reset due to tx timeout [310661.484424] ixgbe :04:00.1 enp4s0f1: initiating reset due to tx timeout [310662.652232] ixgbe :82:00.1 enp130s0f1: initiating reset due to tx timeout [310663.620879] ixgbe :84:00.1 enp132s0f1: initiating reset due to tx timeout [310664.605672] ixgbe :06:00.1 enp6s0f1: initiating reset due to tx timeout [310666.775352] ixgbe :84:00.0 enp132s0f0: initiating reset due to tx timeout [310667.565003] ixgbe :04:00.1 enp4s0f1: initiating reset due to tx timeout [310668.349902] ixgbe :82:00.1 enp130s0f1: initiating reset due to tx timeout [310669.534595] ixgbe :84:00.1 enp132s0f1: initiating reset due to tx timeout [310670.311528] ixgbe :06:00.1 enp6s0f1: initiating reset due to tx timeout [310673.682156] ixgbe :04:00.1 enp4s0f1: initiating reset due to tx timeout [310673.876020] ixgbe :82:00.1 enp130s0f1: initiating reset due to tx timeout [310674.264810] ixgbe :06:00.0 enp6s0f0: initiating reset due to tx timeout [310675.436519] ixgbe :84:00.1 enp132s0f1: initiating reset due to tx timeout [310679.393312] ixgbe :82:00.1 enp130s0f1: initiating reset due to tx timeout [310680.777320] ixgbe :84:00.1 enp132s0f1: initiating reset due to tx timeout [310684.782413] ixgbe :82:00.1 enp130s0f1: initiating reset due to tx timeout [310685.176739] ixgbe :04:00.1 enp4s0f1: initiating reset due to tx timeout [310686.561306] ixgbe :84:00.1 enp132s0f1: initiating reset due to tx timeout [310690.345435] ixgbe :82:00.1 enp130s0f1: initiating reset due to tx
Re: [Patch net] net: invert the check of detecting hardware RX checksum fault
W dniu 16.11.2018 o 21:06, Cong Wang pisze: On Thu, Nov 15, 2018 at 8:50 PM Herbert Xu wrote: On Thu, Nov 15, 2018 at 06:23:38PM -0800, Cong Wang wrote: Normally if the hardware's partial checksum is valid then we just trust it and send the packet along. However, if the partial checksum is invalid we don't trust it and we will compute the whole checksum manually which is what ends up in sum. Not sure if I understand partial checksum here, but it is the CHECKSUM_COMPLETE case which I am trying to fix, not CHECKSUM_PARTIAL. What I meant by partial checksum is the checksum produced by the hardware on RX. In the kernel we call that CHECKSUM_COMPLETE. CHECKSUM_PARTIAL is the absence of the substantial part of the checksum which is something we use in the kernel primarily for TX. Yes the names are confusing :) Yeah, understood. The hardware provides skb->csum in this case, but we keep adjusting it each time when we change skb->data. So, in other word, a checksum *match* is the intended to detect this HW RX checksum fault? Correct. Or more likely it's probably a bug in either the driver or if there are overlaying code such as VLAN then in that code. Basically if the RX checksum is buggy, it's much more likely to cause a valid packet to be rejected than to cause an invalid packet to be accepted, because we still verify that checksum against the pseudoheader. So we only attempt to catch buggy hardware/drivers by doing a second manual verification for the case where the packet is flagged as invalid. Hmm, now I see how it works. Actually it uses the differences between these two check's as the difference between hardware checksum with skb_checksum(). I will send a patch to add a comment there to avoid confusion. Sure, my case is nearly same with Pawel's, except I have no vlan: https://marc.info/?l=linux-netdev&m=154086647601721&w=2 Can you please provide your backtrace? I already did: https://marc.info/?l=linux-netdev&m=154092211305599&w=2 Note, the offending commit has been backported to 4.14, which is why I saw this warning. I have no idea why it is backported from the beginning, it is just an optimization, doesn't fix any bug, IMHO. Also, it is much harder for me to reproduce it than Pawel who saw the warning every second. Sometimes I need 1 hour to trigger it, sometimes other people here needs 10+ hours to trigger it. By the way - changed network controller for vlans where i was receiving rx csum fail to 82599 with ixgbe driver and with mellanox: [91584.359273] vlan980: hw csum failure [91584.359278] CPU: 54 PID: 0 Comm: swapper/54 Not tainted 4.20.0-rc1+ #2 [91584.359279] Call Trace: [91584.359282] [91584.359290] dump_stack+0x46/0x5b [91584.359296] __skb_checksum_complete+0x9b/0xb0 [91584.359301] icmp_rcv+0x51/0x1f0 [91584.359305] ip_local_deliver_finish+0x49/0xd0 [91584.359307] ip_local_deliver+0xb7/0xe0 [91584.359309] ? ip_sublist_rcv_finish+0x50/0x50 [91584.359310] ip_rcv+0x96/0xc0 [91584.359313] __netif_receive_skb_one_core+0x4b/0x70 [91584.359315] netif_receive_skb_internal+0x2f/0xc0 [91584.359316] napi_gro_receive+0xb0/0xd0 [91584.359320] mlx5e_handle_rx_cqe+0x78/0xd0 [91584.359321] mlx5e_poll_rx_cq+0xc4/0x970 [91584.359323] mlx5e_napi_poll+0xab/0xcb0 [91584.359325] net_rx_action+0xd9/0x300 [91584.359328] __do_softirq+0xd3/0x2d9 [91584.359333] irq_exit+0x7a/0x80 [91584.359334] do_IRQ+0x72/0xc0 [91584.359336] common_interrupt+0xf/0xf [91584.359337] [91584.359340] RIP: 0010:mwait_idle+0x74/0x1b0 [91584.359342] Code: ae f0 31 d2 65 48 8b 04 25 80 4c 01 00 48 89 d1 0f 01 c8 48 8b 00 48 c1 e8 03 83 e0 01 0f 85 26 01 00 00 48 89 c1 fb 0f 01 c9 <65> 8b 2d 95 8e 6b 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0 [91584.359343] RSP: 0018:c900034f3ec0 EFLAGS: 0246 ORIG_RAX: ffde [91584.359344] RAX: RBX: 0036 RCX: [91584.359345] RDX: RSI: RDI: [91584.359346] RBP: 0036 R08: R09: [91584.359346] R10: 0001008b49bb R11: 0c00 R12: [91584.359347] R13: R14: R15: [91584.359352] do_idle+0x19f/0x1c0 [91584.359354] ? do_idle+0x4/0x1c0 [91584.359355] cpu_startup_entry+0x14/0x20 [91584.359360] start_secondary+0x165/0x190 [91584.359364] secondary_startup_64+0xa4/0xb0 With intel no errors. Let me see if I can add vlan on my side to make it more reproducible, it seems hard as our switch doesn't use vlan either. We have warnings with conntrack involved too, I can provide it too if you are interested. I tend to revert it for -stable, at least that is what I plan to do on my side unless there is a fix coming soon. Thanks.
Re: consistency for statistics with XDP mode
W dniu 21.11.2018 o 22:14, Toke Høiland-Jørgensen pisze: David Ahern writes: Paweł ran some more XDP tests yesterday and from it found a couple of issues. One is a panic in the mlx5 driver unloading the bpf program (mlx5e_xdp_xmit); he will send a send a separate email for that problem. Same as this one, I guess? https://marc.info/?l=linux-netdev&m=153855905619717&w=2 Yes same as this one. When there is no traffic (for example with xdp_fwd program loaded) or there is not much traffic like 1k frames per second for icmp - i can load/unload without crashing kernel But when i push tests with pktgen and use more than >50k pps for udp - then unbinding xdp_fwd program makes kernel to panic :) The problem I wanted to discuss here is statistics for XDP context. The short of it is that we need consistency in the counters across NIC drivers and virtual devices. Right now stats are specific to a driver with no clear accounting for the packets and bytes handled in XDP. For example virtio has some stats as device private data extracted via ethtool: $ ethtool -S eth2 | grep xdp ... rx_queue_3_xdp_packets: 5291 rx_queue_3_xdp_tx: 0 rx_queue_3_xdp_redirects: 5163 rx_queue_3_xdp_drops: 0 ... tx_queue_3_xdp_tx: 5163 tx_queue_3_xdp_tx_drops: 0 And the standard counters appear to track bytes and packets for Rx, but not Tx if the packet is forwarded in XDP. Similarly, mlx5 has some counters (thanks to Jesper and Toke for helping out here): $ ethtool -S mlx5p1 | grep xdp rx_xdp_drop: 86468350180 rx_xdp_redirect: 18860584 rx_xdp_tx_xmit: 0 rx_xdp_tx_full: 0 rx_xdp_tx_err: 0 rx_xdp_tx_cqe: 0 tx_xdp_xmit: 0 tx_xdp_full: 0 tx_xdp_err: 0 tx_xdp_cqes: 0 ... rx3_xdp_drop: 86468350180 rx3_xdp_redirect: 18860556 rx3_xdp_tx_xmit: 0 rx3_xdp_tx_full: 0 rx3_xdp_tx_err: 0 rx3_xdp_tx_cqes: 0 ... tx0_xdp_xmit: 0 tx0_xdp_full: 0 tx0_xdp_err: 0 tx0_xdp_cqes: 0 ... And no accounting in standard stats for packets handled in XDP. And then if I understand Jesper's data correctly, the i40e driver does not have device specific data: $ ethtool -S i40e1 | grep xdp [NOTHING] But rather bumps the standard counters: sudo ./xdp_rxq_info --dev i40e1 --action XDP_DROP Running XDP on dev:i40e1 (ifindex:3) action:XDP_DROP options:no_touch XDP stats CPU pps issue-pps XDP-RX CPU 1 36,156,872 0 XDP-RX CPU total 36,156,872 RXQ stats RXQ:CPU pps issue-pps rx_queue_index1:1 36,156,878 0 rx_queue_index1:sum 36,156,878 $ ethtool_stats.pl --dev i40e1 Show adapter(s) (i40e1) statistics (ONLY that changed!) Ethtool(i40e1 ) stat: 2711292859 ( 2,711,292,859) <= port.rx_bytes /sec Ethtool(i40e1 ) stat: 6274204 ( 6,274,204) <= port.rx_dropped /sec Ethtool(i40e1 ) stat: 42363867 ( 42,363,867) <= port.rx_size_64 /sec Ethtool(i40e1 ) stat: 42363950 ( 42,363,950) <= port.rx_unicast /sec Ethtool(i40e1 ) stat: 2165051990 ( 2,165,051,990) <= rx-1.bytes /sec Ethtool(i40e1 ) stat: 36084200 ( 36,084,200) <= rx-1.packets /sec Ethtool(i40e1 ) stat: 5385 ( 5,385) <= rx_dropped /sec Ethtool(i40e1 ) stat: 36089727 ( 36,089,727) <= rx_unicast /sec We really need consistency in the counters and at a minimum, users should be able to track packet and byte counters for both Rx and Tx including XDP. It seems to me the Rx and Tx packet, byte and dropped counters returned for the standard device stats (/proc/net/dev, ip -s li show, ...) should include all packets managed by the driver regardless of whether they are forwarded / dropped in XDP or go up the Linux stack. This also aligns with mlxsw and the stats it shows which are packets handled by the hardware. From there the private stats can include XDP specifics as desired -- like the drops and redirects but that those should be add-ons and even here some consistency makes life easier for users. The same standards should be also be applied to virtual devices built on top of the ports -- e.g, vlans. I have an API now that allows bumping stats for vlan devices. Keeping the basic xdp packets in the standard counters allows Paweł, for example, to continue to monitor /proc/net/dev. Can we get agreement on this? And from there, get updates to the mlx5 and virtio drivers? I'd say it sounds reasonable to include XDP in the normal traffic counters, but having the detailed XDP-specific counters is quite useful as well... So can't we do both (for all drivers)? -Toke
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 19.11.2018 o 22:59, David Ahern pisze: On 11/9/18 5:06 PM, David Ahern wrote: On 11/9/18 9:21 AM, David Ahern wrote: Is there possible to add only counters from xdp for vlans ? This will help me in testing. I will take a look today at adding counters that you can dump using bpftool. It will be a temporary solution for this xdp program only. Same tree, kernel-tables-wip-02 branch. Compile kernel and install. Compile samples as before. new version: https://github.com/dsahern/linux.git bpf/kernel-tables-wip-03 This one prototypes incrementing counters for VLAN devices (rx/tx, packets and bytes). Counters for netdevices representing physical ports should be managed by the NIC driver. Will test it today Thanks Paweł I will look at what can be done for packet captures (e.g., xdpdump and https://github.com/facebookincubator/katran/tree/master/tools). Most likely a project for next week.
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 11.11.2018 o 09:56, Jesper Dangaard Brouer pisze: On Sat, 10 Nov 2018 22:53:53 +0100 Paweł Staszewski wrote: Now im messing with ring configuration for connectx5 nics. And after reading that paper: https://netdevconf.org/2.1/slides/apr6/network-performance/04-amir-RX_and_TX_bulking_v2.pdf Do notice that some of the ideas in that slide deck, was never implemented. But they are still on my todo list ;-). Notice how that it show that TX bulking is very important, but based on your ethtool_stats.pl, I can see that not much TX bulking is happening in your case. This is indicated via the xmit_more counters. Ethtool(enp175s0) stat:2630 ( 2,630) <= tx_xmit_more /sec Ethtool(enp175s0) stat: 4956995 ( 4,956,995) <= tx_packets /sec And the per queue levels are also avail: Ethtool(enp175s0) stat: 184845 ( 184,845) <= tx7_packets /sec Ethtool(enp175s0) stat: 78 ( 78) <= tx7_xmit_more /sec This means that you are doing too many doorbell's to the NIC hardware at TX time, which I worry could be what cause the NIC and PCIe hardware not to operate at optimal speeds. After tunning coal/ring a little with ethtool Reached today: bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate | iface Rx Tx Total == enp175s0: 50.68 Gb/s 21.53 Gb/s 72.20 Gb/s enp216s0: 21.62 Gb/s 50.81 Gb/s 72.42 Gb/s -- total: 72.30 Gb/s 72.33 Gb/s 144.63 Gb/s And still no packet loss (icmp side to side test every 100ms) Below perf top PerfTop: 104692 irqs/sec kernel:99.5% exact: 0.0% [4000Hz cycles], (all, 56 CPUs) --- 9.06% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_linear 6.43% [kernel] [k] tasklet_action_common.isra.21 5.68% [kernel] [k] fib_table_lookup 4.89% [kernel] [k] irq_entries_start 4.53% [kernel] [k] mlx5_eq_int 4.10% [kernel] [k] build_skb 3.39% [kernel] [k] mlx5e_poll_tx_cq 3.38% [kernel] [k] mlx5e_sq_xmit 2.73% [kernel] [k] mlx5e_poll_rx_cq 2.18% [kernel] [k] __dev_queue_xmit 2.13% [kernel] [k] vlan_do_receive 2.12% [kernel] [k] mlx5e_handle_rx_cqe_mpwrq 2.00% [kernel] [k] ip_finish_output2 1.87% [kernel] [k] mlx5e_post_rx_mpwqes 1.86% [kernel] [k] memcpy_erms 1.85% [kernel] [k] ipt_do_table 1.70% [kernel] [k] dev_gro_receive 1.39% [kernel] [k] __netif_receive_skb_core 1.31% [kernel] [k] inet_gro_receive 1.21% [kernel] [k] ip_route_input_rcu 1.21% [kernel] [k] tcp_gro_receive 1.13% [kernel] [k] _raw_spin_lock 1.08% [kernel] [k] __build_skb 1.06% [kernel] [k] kmem_cache_free_bulk 1.05% [kernel] [k] __softirqentry_text_start 1.03% [kernel] [k] vlan_dev_hard_start_xmit 0.98% [kernel] [k] pfifo_fast_dequeue 0.95% [kernel] [k] mlx5e_xmit 0.95% [kernel] [k] page_frag_free 0.88% [kernel] [k] ip_forward 0.81% [kernel] [k] dev_hard_start_xmit 0.78% [kernel] [k] rcu_irq_exit 0.77% [kernel] [k] netif_skb_features 0.72% [kernel] [k] napi_complete_done 0.72% [kernel] [k] kmem_cache_alloc 0.68% [kernel] [k] validate_xmit_skb.isra.142 0.66% [kernel] [k] ip_rcv_core.isra.20.constprop.25 0.58% [kernel] [k] swiotlb_map_page 0.57% [kernel] [k] __qdisc_run 0.56% [kernel] [k] tasklet_action 0.54% [kernel] [k] __get_xps_queue_idx 0.54% [kernel] [k] inet_lookup_ifaddr_rcu 0.50% [kernel] [k] tcp4_gro_receive 0.49% [kernel] [k] skb_release_data 0.47% [kernel] [k] eth_type_trans 0.40% [kernel] [k] sch_direct_xmit 0.40% [kernel] [k] net_rx_action 0.39% [kernel] [k] __local_bh_enable_ip And perf record/report https://ufile.io/zguq0 So now i know what was causing cpu load for some processes like: 2913 root 20 0 0 0 0 I 10.3 0.0 6:58.29 kworker/u112:1- 7 root 20 0 0 0 0 I 8.6 0.0 6:17.18 kworker/u112:0- 10289 root 20 0 0 0 0 I 6.6 0.0 6:33.90 kworker/u112:4- 2939 root 20 0 0 0 0 R 3.6 0.0 7:37.68 kworker/u112:2- After disabling adaptative tx
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 11.11.2018 o 09:03, Jesper Dangaard Brouer pisze: On Sat, 10 Nov 2018 23:19:50 +0100 Paweł Staszewski wrote: W dniu 10.11.2018 o 23:06, Jesper Dangaard Brouer pisze: On Sat, 10 Nov 2018 20:56:02 +0100 Paweł Staszewski wrote: W dniu 10.11.2018 o 20:49, Paweł Staszewski pisze: W dniu 10.11.2018 o 20:34, Jesper Dangaard Brouer pisze: On Fri, 9 Nov 2018 23:20:38 +0100 Paweł Staszewski wrote: W dniu 08.11.2018 o 20:12, Paweł Staszewski pisze: [...] Do notice, the per CPU squeeze is not too large. Yes - but im searching invisible thing now :) something invisible is slowing down packet processing :) So trying to find any counter that have something to do with packet processing. NOTICE, I have given you the counters you need (below) Yes noticed this :) [...] Remember those tests are now on two separate connectx5 connected to two separate pcie x16 gen 3.0 That is strange... I still suspect some HW NIC issue, can you provide ethtool stats info via tool: https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl $ ethtool_stats.pl --dev enp175s0 --dev enp216s0 The tool remove zero-stats counters and report per sec stats. It makes it easier to spot that is relevant for the given workload. yes mlnx have just too many counters that are always 0 for my case :) Will try this also But still alot of non 0 counters Show adapter(s) (enp175s0 enp216s0) statistics (ONLY that changed!) Ethtool(enp175s0) stat: 8891 ( 8,891) <= ch0_arm /sec [...] I have copied the stats over in another document so I can better looks at it... and I've found some interesting stats. E.g. we can see that the NIC hardware is dropping packets. RX-drops on enp175s0: (enp175s0) stat: 4850734036 ( 4,850,734,036) <= rx_bytes /sec (enp175s0) stat: 5069043007 ( 5,069,043,007) <= rx_bytes_phy /sec -218308971 ( -218,308,971) Dropped bytes /sec (enp175s0) stat: 139602 ( 139,602) <= rx_discards_phy /sec (enp175s0) stat: 3717148 ( 3,717,148) <= rx_packets /sec (enp175s0) stat: 3862420 ( 3,862,420) <= rx_packets_phy /sec -145272 ( -145,272) Dropped packets /sec RX-drops on enp216s0 is less: (enp216s0) stat: 2592286809 ( 2,592,286,809) <= rx_bytes /sec (enp216s0) stat: 2633575771 ( 2,633,575,771) <= rx_bytes_phy /sec -41288962 ( -41,288,962) Dropped bytes /sec (enp216s0) stat: 464 (464) <= rx_discards_phy /sec (enp216s0) stat: 4971677 ( 4,971,677) <= rx_packets /sec (enp216s0) stat: 4975563 ( 4,975,563) <= rx_packets_phy /sec -3886 (-3,886) Dropped packets /sec I would recommend, that you use ethtool stats and monitor rx_discards_phy. The PHY are the counters from the hardware, and it shows that packets are getting dropped at HW level. This can be because software is not fast enough to empty RX-queue, but in this case where CPUs are mostly idle I don't think that is the case. That is why i was searching some counter for software - where is something wrong. Cause in earlier reports from ethtool there was also phy drops reported - just when cpu's was saturated that was normal for me that phy can drop packets if no more cpu cycles available to pickup them from hw But in case where i have 50% idle cpu's - there should be no problem - that is why i start to modify ethtool params for tx/rx ring and coalescence Currently waiting for more traffic with new ethtool settings: ethtool -g enp175s0 Ring parameters for enp175s0: Pre-set maximums: RX: 8192 RX Mini: 0 RX Jumbo: 0 TX: 8192 Current hardware settings: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 128 ethtool -c enp175s0 Coalesce parameters for enp175s0: Adaptive RX: off TX: on stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 dmac: 32517 rx-usecs: 64 rx-frames: 128 rx-usecs-irq: 0 rx-frames-irq: 0 tx-usecs: 8 tx-frames: 128 tx-usecs-irq: 0 tx-frames-irq: 0 rx-usecs-low: 0 rx-frame-low: 0 tx-usecs-low: 0 tx-frame-low: 0 rx-usecs-high: 0 rx-frame-high: 0 tx-usecs-high: 0 tx-frame-high: 0 Both ports same settings. Current traffic: bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate | iface Rx Tx Total == enp175s0: 37.85 Gb/s 7.77 Gb/s 45.62 Gb/s enp216s0: 7.80 Gb/s 37.90 Gb/s 45.70 Gb/s -- total: 45.61 Gb/s 45.63 Gb/s 91.24 Gb/s and mpstat for cpu's Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 10.11.2018 o 23:06, Jesper Dangaard Brouer pisze: On Sat, 10 Nov 2018 20:56:02 +0100 Paweł Staszewski wrote: W dniu 10.11.2018 o 20:49, Paweł Staszewski pisze: W dniu 10.11.2018 o 20:34, Jesper Dangaard Brouer pisze: On Fri, 9 Nov 2018 23:20:38 +0100 Paweł Staszewski wrote: W dniu 08.11.2018 o 20:12, Paweł Staszewski pisze: CPU load is lower than for connectx4 - but it looks like bandwidth limit is the same :) But also after reaching 60Gbit/60Gbit bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate - iface Rx Tx Total === enp175s0: 45.09 Gb/s 15.09 Gb/s 60.18 Gb/s enp216s0: 15.14 Gb/s 45.19 Gb/s 60.33 Gb/s --- total: 60.45 Gb/s 60.48 Gb/s 120.93 Gb/s Today reached 65/65Gbit/s But starting from 60Gbit/s RX / 60Gbit TX nics start to drop packets (with 50%CPU on all 28cores) - so still there is cpu power to use :). This is weird! How do you see / measure these drops? Simple icmp test like ping -i 0.1 And im testing by icmp management ip address on vlan that is attacked to one NIC (the side that is more stressed with RX) And another icmp test is forward thru this router - host behind it Both measurements shows same loss ratio from 0.1 to 0.5% after reaching ~45Gbit/s RX side - depends how much RX side is pushed drops vary between 0.1 to 0.5 - even 0.6%:) Okay good to know, you use an external measurement for this. I do think packets are getting dropped by the NIC. So checked other stats. softnet_stats shows average 1k squeezed per sec: Is below output the raw counters? not per sec? It would be valuable to see the per sec stats instead... I use this tool: https://github.com/netoptimizer/network-testing/blob/master/bin/softnet_stat.pl CPU total/sec dropped/sec squeezed/sec collision/sec rx_rps/sec flow_limit/sec CPU:00 0 0 0 0 0 0 [...] CPU:13 0 0 0 0 0 0 CPU:14 485538 0 43 0 0 0 CPU:15 474794 0 51 0 0 0 CPU:16 449322 0 41 0 0 0 CPU:17 476420 0 46 0 0 0 CPU:18 440436 0 38 0 0 0 CPU:19 501499 0 49 0 0 0 CPU:20 459468 0 49 0 0 0 CPU:21 438928 0 47 0 0 0 CPU:22 468983 0 40 0 0 0 CPU:23 446253 0 47 0 0 0 CPU:24 451909 0 46 0 0 0 CPU:25 479373 0 55 0 0 0 CPU:26 467848 0 49 0 0 0 CPU:27 453153 0 51 0 0 0 CPU:28 0 0 0 0 0 0 [...] CPU:40 0 0 0 0 0 0 CPU:41 0 0 0 0 0 0 CPU:42 466853 0 43 0 0 0 CPU:43 453059 0 54 0 0 0 CPU:44 363219 0 34 0 0 0 CPU:45 353632 0 38 0 0 0 CPU:46 371618 0 40 0 0 0 CPU:47 350518 0 46 0 0 0 CPU:48 397544 0 40 0 0 0 CPU:49 364873 0 38 0 0 0 CPU:50 383630 0 38 0 0 0 CPU:51 358771 0 39 0 0 0 CPU:52 372547 0 38 0 0 0 CPU:53 372882 0 36 0 0 0 CPU:54 366244 0 43 0
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 10.11.2018 o 22:53, Paweł Staszewski pisze: W dniu 10.11.2018 o 22:01, Jesper Dangaard Brouer pisze: On Sat, 10 Nov 2018 21:02:10 +0100 Paweł Staszewski wrote: W dniu 10.11.2018 o 20:34, Jesper Dangaard Brouer pisze: I want you to experiment with: ethtool --set-priv-flags DEVICE rx_striding_rq off just checked that previously connectx4 was have thos disabled: ethtool --show-priv-flags enp175s0f0 Private flags for enp175s0f0: rx_cqe_moder : on tx_cqe_moder : off rx_cqe_compress : off rx_striding_rq : off rx_no_csum_complete: off The CX4 hardware does not have this feature (p.s. the CX4-Lx does). So now we are on connectx5 and we have enabled - for sure connectx5 changed cpu load - where i have now max 50/60% cpu where with connectx4 there was sometimes near 100% with same configuration. I (strongly) believe the CPU load was related to the page-alloactor lock congestion, that Aaron fixed. Yes i think both - most problems with cpu was due to page-allocator problems. But also after change connctx4 to connectx5 there is cpu load difference - about 10% in total - but yes most of this like 40% is cause of Aaron patch :) - rly good job :) Now im messing with ring configuration for connectx5 nics. And after reading that paper: https://netdevconf.org/2.1/slides/apr6/network-performance/ 04-amir-RX_and_TX_bulking_v2.pdf changed from RX:8192 / TX: 4096 to RX:8192 / TX: 256 after this i gain about 5Gbit/s RX and TX traffic and less cpu load before change there was 59/59 Gbit/s After change there is 64/64 Gbit/s bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate | iface Rx Tx Total == enp175s0: 44.45 Gb/s 19.69 Gb/s 64.14 Gb/s enp216s0: 19.69 Gb/s 44.49 Gb/s 64.19 Gb/s -- total: 64.14 Gb/s 64.18 Gb/s 128.33 Gb/s Also after this change kernel freed some memory... like 500MB Still squeezed but less with more traffic... CPU total/sec dropped/sec squeezed/sec collision/sec rx_rps/sec flow_limit/sec CPU:00 0 0 0 0 0 0 CPU:01 0 0 0 0 0 0 CPU:02 0 0 0 0 0 0 CPU:03 0 0 0 0 0 0 CPU:04 0 0 0 0 0 0 CPU:05 0 0 0 0 0 0 CPU:06 0 0 0 0 0 0 CPU:07 0 0 0 0 0 0 CPU:08 0 0 0 0 0 0 CPU:09 0 0 0 0 0 0 CPU:10 0 0 0 0 0 0 CPU:11 0 0 0 0 0 0 CPU:12 0 0 0 0 0 0 CPU:13 0 0 0 0 0 0 CPU:14 389270 0 41 0 0 0 CPU:15 375543 0 32 0 0 0 CPU:16 385847 0 22 0 0 0 CPU:17 412293 0 34 0 0 0 CPU:18 401287 0 30 0 0 0 CPU:19 368345 0 30 0 0 0 CPU:20 395452 0 28 0 0 0 CPU:21 374032 0 38 0 0 0 CPU:22 342036 0 32 0 0 0 CPU:23 374773 0 34 0 0 0 CPU:24 356139 0 31 0 0 0 CPU:25 392725 0 32 0 0 0 CPU:26 385937 0 37 0 0 0 CPU:27 385282 0 37 0 0 0 CPU:28 0 0 0 0 0 0 CPU:29
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 10.11.2018 o 22:01, Jesper Dangaard Brouer pisze: On Sat, 10 Nov 2018 21:02:10 +0100 Paweł Staszewski wrote: W dniu 10.11.2018 o 20:34, Jesper Dangaard Brouer pisze: I want you to experiment with: ethtool --set-priv-flags DEVICE rx_striding_rq off just checked that previously connectx4 was have thos disabled: ethtool --show-priv-flags enp175s0f0 Private flags for enp175s0f0: rx_cqe_moder : on tx_cqe_moder : off rx_cqe_compress : off rx_striding_rq : off rx_no_csum_complete: off The CX4 hardware does not have this feature (p.s. the CX4-Lx does). So now we are on connectx5 and we have enabled - for sure connectx5 changed cpu load - where i have now max 50/60% cpu where with connectx4 there was sometimes near 100% with same configuration. I (strongly) believe the CPU load was related to the page-alloactor lock congestion, that Aaron fixed. Yes i think both - most problems with cpu was due to page-allocator problems. But also after change connctx4 to connectx5 there is cpu load difference - about 10% in total - but yes most of this like 40% is cause of Aaron patch :) - rly good job :) Now im messing with ring configuration for connectx5 nics. And after reading that paper: https://netdevconf.org/2.1/slides/apr6/network-performance/ 04-amir-RX_and_TX_bulking_v2.pdf changed from RX:8192 / TX: 4096 to RX:8192 / TX: 256 after this i gain about 5Gbit/s RX and TX traffic and less cpu load before change there was 59/59 Gbit/s After change there is 64/64 Gbit/s bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate | iface Rx Tx Total == enp175s0: 44.45 Gb/s 19.69 Gb/s 64.14 Gb/s enp216s0: 19.69 Gb/s 44.49 Gb/s 64.19 Gb/s -- total: 64.14 Gb/s 64.18 Gb/s 128.33 Gb/s
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 10.11.2018 o 20:34, Jesper Dangaard Brouer pisze: I want you to experiment with: ethtool --set-priv-flags DEVICE rx_striding_rq off just checked that previously connectx4 was have thos disabled: ethtool --show-priv-flags enp175s0f0 Private flags for enp175s0f0: rx_cqe_moder : on tx_cqe_moder : off rx_cqe_compress : off rx_striding_rq : off rx_no_csum_complete: off So now we are on connectx5 and we have enabled - for sure connectx5 changed cpu load - where i have now max 50/60% cpu where with connectx4 there was sometimes near 100% with same configuration.
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 10.11.2018 o 20:49, Paweł Staszewski pisze: W dniu 10.11.2018 o 20:34, Jesper Dangaard Brouer pisze: On Fri, 9 Nov 2018 23:20:38 +0100 Paweł Staszewski wrote: W dniu 08.11.2018 o 20:12, Paweł Staszewski pisze: CPU load is lower than for connectx4 - but it looks like bandwidth limit is the same :) But also after reaching 60Gbit/60Gbit bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate - iface Rx Tx Total == enp175s0: 45.09 Gb/s 15.09 Gb/s 60.18 Gb/s enp216s0: 15.14 Gb/s 45.19 Gb/s 60.33 Gb/s -- total: 60.45 Gb/s 60.48 Gb/s 120.93 Gb/s Today reached 65/65Gbit/s But starting from 60Gbit/s RX / 60Gbit TX nics start to drop packets (with 50%CPU on all 28cores) - so still there is cpu power to use :). This is weird! How do you see / measure these drops? Simple icmp test like ping -i 0.1 And im testing by icmp management ip address on vlan that is attacked to one NIC (the side that is more stressed with RX) And another icmp test is forward thru this router - host behind it Both measurements shows same loss ratio from 0.1 to 0.5% after reaching ~45Gbit/s RX side - depends how much RX side is pushed drops vary between 0.1 to 0.5 - even 0.6%:) So checked other stats. softnet_stats shows average 1k squeezed per sec: Is below output the raw counters? not per sec? It would be valuable to see the per sec stats instead... I use this tool: https://github.com/netoptimizer/network-testing/blob/master/bin/softnet_stat.pl CPU total/sec dropped/sec squeezed/sec collision/sec rx_rps/sec flow_limit/sec CPU:00 0 0 0 0 0 0 CPU:01 0 0 0 0 0 0 CPU:02 0 0 0 0 0 0 CPU:03 0 0 0 0 0 0 CPU:04 0 0 0 0 0 0 CPU:05 0 0 0 0 0 0 CPU:06 0 0 0 0 0 0 CPU:07 0 0 0 0 0 0 CPU:08 0 0 0 0 0 0 CPU:09 0 0 0 0 0 0 CPU:10 0 0 0 0 0 0 CPU:11 0 0 0 0 0 0 CPU:12 0 0 0 0 0 0 CPU:13 0 0 0 0 0 0 CPU:14 485538 0 43 0 0 0 CPU:15 474794 0 51 0 0 0 CPU:16 449322 0 41 0 0 0 CPU:17 476420 0 46 0 0 0 CPU:18 440436 0 38 0 0 0 CPU:19 501499 0 49 0 0 0 CPU:20 459468 0 49 0 0 0 CPU:21 438928 0 47 0 0 0 CPU:22 468983 0 40 0 0 0 CPU:23 446253 0 47 0 0 0 CPU:24 451909 0 46 0 0 0 CPU:25 479373 0 55 0 0 0 CPU:26 467848 0 49 0 0 0 CPU:27 453153 0 51 0 0 0 CPU:28 0 0 0 0 0 0 CPU:29 0 0 0 0 0 0 CPU:30 0 0 0 0 0 0 CPU:31 0 0 0 0 0 0 CPU:32 0 0 0 0 0 0 CPU:33 0 0 0 0 0 0 CPU:34 0
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 10.11.2018 o 20:34, Jesper Dangaard Brouer pisze: On Fri, 9 Nov 2018 23:20:38 +0100 Paweł Staszewski wrote: W dniu 08.11.2018 o 20:12, Paweł Staszewski pisze: CPU load is lower than for connectx4 - but it looks like bandwidth limit is the same :) But also after reaching 60Gbit/60Gbit bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate - iface Rx Tx Total == enp175s0: 45.09 Gb/s 15.09 Gb/s 60.18 Gb/s enp216s0: 15.14 Gb/s 45.19 Gb/s 60.33 Gb/s -- total: 60.45 Gb/s 60.48 Gb/s 120.93 Gb/s Today reached 65/65Gbit/s But starting from 60Gbit/s RX / 60Gbit TX nics start to drop packets (with 50%CPU on all 28cores) - so still there is cpu power to use :). This is weird! How do you see / measure these drops? Simple icmp test like ping -i 0.1 And im testing by icmp management ip address on vlan that is attacked to one NIC (the side that is more stressed with RX) And another icmp test is forward thru this router - host behind it Both measurements shows same loss ratio from 0.1 to 0.5% after reaching ~45Gbit/s RX side - depends how much RX side is pushed drops vary between 0.1 to 0.5 - even 0.6%:) So checked other stats. softnet_stats shows average 1k squeezed per sec: Is below output the raw counters? not per sec? It would be valuable to see the per sec stats instead... I use this tool: https://github.com/netoptimizer/network-testing/blob/master/bin/softnet_stat.pl cpu total dropped squeezed collision rps flow_limit 0 18554 0 1 0 0 0 1 16728 0 1 0 0 0 2 18033 0 1 0 0 0 3 17757 0 1 0 0 0 4 18861 0 0 0 0 0 5 0 0 1 0 0 0 6 2 0 1 0 0 0 7 0 0 1 0 0 0 8 0 0 0 0 0 0 9 0 0 1 0 0 0 10 0 0 0 0 0 0 11 0 0 1 0 0 0 12 50 0 1 0 0 0 13 257 0 0 0 0 0 14 3629115363 0 3353259 0 0 0 15 255167835 0 3138271 0 0 0 16 4240101961 0 3036130 0 0 0 17 599810018 0 3072169 0 0 0 18 432796524 0 3034191 0 0 0 19 41803906 0 3037405 0 0 0 20 900382666 0 3112294 0 0 0 21 620926085 0 3086009 0 0 0 22 41861198 0 3023142 0 0 0 23 4090425574 0 2990412 0 0 0 24 4264870218 0 3010272 0 0 0 25 141401811 0 3027153 0 0 0 26 104155188 0 3051251 0 0 0 27 4261258691 0 3039765 0 0 0 28 4 0 1 0 0 0 29 4 0 0 0 0 0 30 0 0 1 0 0 0 31 0 0 0 0 0 0 32 3 0 1 0 0 0 33 1 0 1 0 0 0 34 0 0 1 0 0 0 35 0 0 0 0 0 0 36 0 0 1 0 0 0 37 0 0 1 0 0 0 38 0 0 1 0 0 0 39 0 0 1 0 0 0 40 0 0 0 0 0 0 41 0 0 1 0 0 0 42 299758202 0 3139693 0 0 0 43 4254727979 0 3103577 0 0 0 44 195943 0 2554885 0 0 0 45 1675702723 0 2513481 0 0 0 46 1908435503 0 2519698 0 0 0 47 1877799710 0 2537768 0 0 0 48 2384274076 0 2584673 0 0 0 49 2598104878 0 2593616 0 0 0 50 1897566829
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 10.11.2018 o 01:06, David Ahern pisze: On 11/9/18 9:21 AM, David Ahern wrote: Is there possible to add only counters from xdp for vlans ? This will help me in testing. I will take a look today at adding counters that you can dump using bpftool. It will be a temporary solution for this xdp program only. Same tree, kernel-tables-wip-02 branch. Compile kernel and install. Compile samples as before. If you give the userspace program a -t arg, it loop showing stats. Ctrl-C to break. The xdp programs are not detached on exit. Example: ./xdp_fwd -t 5 eth1 eth2 eth3 eth4 15:59:32: rx tx dropped skippedl3_devfib_dev index 3: 901158 9011580 18 0 0 index 4: 901159 9011580 20 0 901139 index 10:0 00019 19 index 11:0 000901139 901139 index 15:0 00019 19 index 16:0 000901139 0 Rx and Tx counters are for the physical port. VLANs show up as l3_dev (ingress) and fib_dev (egress). dropped is anytime the xdp program returns XDP_DROP (e.g., invalid packet) skipped is anytime the program returns XDP_PASS (e.g., not ipv4 or ipv6, local traffic, or needs full stack assist). recompiled new version but: ./xdp_fwd enp175s0f0 enp175s0f1 libbpf: failed to create map (name: 'stats_map'): Operation not permitted libbpf: failed to load object './xdp_fwd_kern.o'
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 08.11.2018 o 20:12, Paweł Staszewski pisze: CPU load is lower than for connectx4 - but it looks like bandwidth limit is the same :) But also after reaching 60Gbit/60Gbit bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate - iface Rx Tx Total == enp175s0: 45.09 Gb/s 15.09 Gb/s 60.18 Gb/s enp216s0: 15.14 Gb/s 45.19 Gb/s 60.33 Gb/s -- total: 60.45 Gb/s 60.48 Gb/s 120.93 Gb/s Today reached 65/65Gbit/s But starting from 60Gbit/s RX / 60Gbit TX nics start to drop packets (with 50%CPU on all 28cores) - so still there is cpu power to use :). So checked other stats. softnet_stats shows average 1k squeezed per sec: cpu total dropped squeezed collision rps flow_limit 0 18554 0 1 0 0 0 1 16728 0 1 0 0 0 2 18033 0 1 0 0 0 3 17757 0 1 0 0 0 4 18861 0 0 0 0 0 5 0 0 1 0 0 0 6 2 0 1 0 0 0 7 0 0 1 0 0 0 8 0 0 0 0 0 0 9 0 0 1 0 0 0 10 0 0 0 0 0 0 11 0 0 1 0 0 0 12 50 0 1 0 0 0 13 257 0 0 0 0 0 14 3629115363 0 3353259 0 0 0 15 255167835 0 3138271 0 0 0 16 4240101961 0 3036130 0 0 0 17 599810018 0 3072169 0 0 0 18 432796524 0 3034191 0 0 0 19 41803906 0 3037405 0 0 0 20 900382666 0 3112294 0 0 0 21 620926085 0 3086009 0 0 0 22 41861198 0 3023142 0 0 0 23 4090425574 0 2990412 0 0 0 24 4264870218 0 3010272 0 0 0 25 141401811 0 3027153 0 0 0 26 104155188 0 3051251 0 0 0 27 4261258691 0 3039765 0 0 0 28 4 0 1 0 0 0 29 4 0 0 0 0 0 30 0 0 1 0 0 0 31 0 0 0 0 0 0 32 3 0 1 0 0 0 33 1 0 1 0 0 0 34 0 0 1 0 0 0 35 0 0 0 0 0 0 36 0 0 1 0 0 0 37 0 0 1 0 0 0 38 0 0 1 0 0 0 39 0 0 1 0 0 0 40 0 0 0 0 0 0 41 0 0 1 0 0 0 42 299758202 0 3139693 0 0 0 43 4254727979 0 3103577 0 0 0 44 195943 0 2554885 0 0 0 45 1675702723 0 2513481 0 0 0 46 1908435503 0 2519698 0 0 0 47 1877799710 0 2537768 0 0 0 48 2384274076 0 2584673 0 0 0 49 2598104878 0 2593616 0 0 0 50 1897566829 0 2530857 0 0 0 51 1712741629 0 2489089 0 0 0 52 1704033648 0 2495892 0 0 0 53 1636781820 0 2499783 0 0 0 54 1861997734 0 2541060 0 0 0 55 2113521616 0 2555673 0 0 0 So i rised netdev backlog and budged to rly high values 524288 for netdev_budget and same for backlog This rised sortirqs from about 600k/sec to 800k/sec for NET_TX/NET_RX But after this changes i have less packets drops. Below perf top from max traffic reached: PerfTop: 72230 irqs/sec kernel:99.4% exact: 0.0% [4000Hz cycles], (al
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 09.11.2018 o 17:21, David Ahern pisze: On 11/9/18 3:20 AM, Paweł Staszewski wrote: I just catch some weird behavior :) All was working fine for about 20k packets Then after xdp start to forward every 10 packets Interesting. Any counter showing drops? nothing that will fit NIC statistics: rx_packets: 187041 rx_bytes: 10600954 tx_packets: 40316 tx_bytes: 16526844 tx_tso_packets: 797 tx_tso_bytes: 3876084 tx_tso_inner_packets: 0 tx_tso_inner_bytes: 0 tx_added_vlan_packets: 38391 tx_nop: 2 rx_lro_packets: 0 rx_lro_bytes: 0 rx_ecn_mark: 0 rx_removed_vlan_packets: 187041 rx_csum_unnecessary: 0 rx_csum_none: 150011 rx_csum_complete: 37030 rx_csum_unnecessary_inner: 0 rx_xdp_drop: 0 rx_xdp_redirect: 64893 rx_xdp_tx_xmit: 0 rx_xdp_tx_full: 0 rx_xdp_tx_err: 0 rx_xdp_tx_cqe: 0 tx_csum_none: 2468 tx_csum_partial: 35955 tx_csum_partial_inner: 0 tx_queue_stopped: 0 tx_queue_dropped: 0 tx_xmit_more: 0 tx_recover: 0 tx_cqes: 38423 tx_queue_wake: 0 tx_udp_seg_rem: 0 tx_cqe_err: 0 tx_xdp_xmit: 0 tx_xdp_full: 0 tx_xdp_err: 0 tx_xdp_cqes: 0 rx_wqe_err: 0 rx_mpwqe_filler_cqes: 0 rx_mpwqe_filler_strides: 0 rx_buff_alloc_err: 0 rx_cqe_compress_blks: 0 rx_cqe_compress_pkts: 0 rx_page_reuse: 0 rx_cache_reuse: 186302 rx_cache_full: 0 rx_cache_empty: 666768 rx_cache_busy: 174 rx_cache_waive: 0 rx_congst_umr: 0 rx_arfs_err: 0 ch_events: 249320 ch_poll: 249321 ch_arm: 249001 ch_aff_change: 0 ch_eq_rearm: 0 rx_out_of_buffer: 0 rx_if_down_packets: 57 rx_vport_unicast_packets: 142659 rx_vport_unicast_bytes: 42706914 tx_vport_unicast_packets: 40167 tx_vport_unicast_bytes: 16668096 rx_vport_multicast_packets: 39188170 rx_vport_multicast_bytes: 3466527450 tx_vport_multicast_packets: 58 tx_vport_multicast_bytes: 4556 rx_vport_broadcast_packets: 16343520 rx_vport_broadcast_bytes: 1031334602 tx_vport_broadcast_packets: 91 tx_vport_broadcast_bytes: 5460 rx_vport_rdma_unicast_packets: 0 rx_vport_rdma_unicast_bytes: 0 tx_vport_rdma_unicast_packets: 0 tx_vport_rdma_unicast_bytes: 0 rx_vport_rdma_multicast_packets: 0 rx_vport_rdma_multicast_bytes: 0 tx_vport_rdma_multicast_packets: 0 tx_vport_rdma_multicast_bytes: 0 tx_packets_phy: 40316 rx_packets_phy: 55674361 rx_crc_errors_phy: 0 tx_bytes_phy: 16839376 rx_bytes_phy: 4763267396 tx_multicast_phy: 58 tx_broadcast_phy: 91 rx_multicast_phy: 39188180 rx_broadcast_phy: 16343521 rx_in_range_len_errors_phy: 0 rx_out_of_range_len_phy: 0 rx_oversize_pkts_phy: 0 rx_symbol_err_phy: 0 tx_mac_control_phy: 0 rx_mac_control_phy: 0 rx_unsupported_op_phy: 0 rx_pause_ctrl_phy: 0 tx_pause_ctrl_phy: 0 rx_discards_phy: 1 tx_discards_phy: 0 tx_errors_phy: 0 rx_undersize_pkts_phy: 0 rx_fragments_phy: 0 rx_jabbers_phy: 0 rx_64_bytes_phy: 3792455 rx_65_to_127_bytes_phy: 51821620 rx_128_to_255_bytes_phy: 37669 rx_256_to_511_bytes_phy: 1481 rx_512_to_1023_bytes_phy: 434 rx_1024_to_1518_bytes_phy: 694 rx_1519_to_2047_bytes_phy: 20008 rx_2048_to_4095_bytes_phy: 0 rx_4096_to_8191_bytes_phy: 0 rx_8192_to_10239_bytes_phy: 0 link_down_events_phy: 0 rx_pcs_symbol_err_phy: 0 rx_corrected_bits_phy: 6 rx_err_lane_0_phy: 0 rx_err_lane_1_phy: 0 rx_err_lane_2_phy: 0 rx_err_lane_3_phy: 6 rx_buffer_passed_thres_phy: 0 rx_pci_signal_integrity: 0 tx_pci_signal_integrity: 82 outbound_pci_stalled_rd: 0 outbound_pci_stalled_wr: 0 outbound_pci_stalled_rd_events: 0 outbound_pci_stalled_wr_events: 0 rx_prio0_bytes: 4144920388 rx_prio0_packets: 48310037 tx_prio0_bytes: 16839376 tx_prio0_packets: 40316 rx_prio1_bytes: 481032 rx_prio1_packets: 7074 tx_prio1_bytes: 0 tx_prio1_packets: 0 rx_prio2_bytes: 9074194 rx_prio2_packets: 106207 tx_prio2_bytes: 0 tx_prio2_packets: 0 rx_prio3_bytes: 0 rx_prio3_packets: 0 tx_prio3_bytes: 0 tx_prio3_packets: 0 rx_prio4_bytes: 0 rx_prio4_packets: 0 tx_prio4_bytes: 0 tx_prio4_packets: 0 rx_prio5_bytes: 0 rx_prio5_packets: 0 tx_prio5_bytes: 0 tx_prio5_packets: 0 rx_prio6_bytes: 371961810 rx_prio6_packets: 4006281 tx_prio6_bytes: 0 tx_prio6_packets: 0 rx_prio7_bytes: 236830040 rx_prio7_packets: 3244761 tx_prio7_bytes: 0 tx_prio7_packets: 0 tx_pause_storm_warning_events : 0 tx_pause_storm_error_events: 0 module_unplug: 0 module_bus_stuck: 0 module_high_temp: 0 module_bad_shorted: 0 NIC
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 08.11.2018 o 17:06, David Ahern pisze: On 11/8/18 6:33 AM, Paweł Staszewski wrote: W dniu 07.11.2018 o 22:06, David Ahern pisze: On 11/3/18 6:24 PM, Paweł Staszewski wrote: Does your setup have any other device types besides physical ports with VLANs (e.g., any macvlans or bonds)? no. just phy(mlnx)->vlans only config VLAN and non-VLAN (and a mix) seem to work ok. Patches are here: https://github.com/dsahern/linux.git bpf/kernel-tables-wip I got lazy with the vlan exports; right now it requires 8021q to be builtin (CONFIG_VLAN_8021Q=y) You can use the xdp_fwd sample: make O=kbuild -C samples/bpf -j 8 Copy samples/bpf/xdp_fwd_kern.o and samples/bpf/xdp_fwd to the server and run: ./xdp_fwd e.g., in my testing I run: xdp_fwd eth1 eth2 eth3 eth4 All of the relevant forwarding ports need to be on the same command line. This version populates a second map to verify the egress port has XDP enabled. Installed today on some lab server with mellanox connectx4 And trying some simple static routing first - but after enabling xdp program - receiver is not receiving frames Route table is simple as possible for tests :) icmp ping test send from 192.168.22.237 to 172.16.0.2 - incomming packets on vlan 4081 ip r default via 192.168.22.236 dev vlan4081 172.16.0.0/30 dev vlan1740 proto kernel scope link src 172.16.0.1 192.168.22.0/24 dev vlan4081 proto kernel scope link src 192.168.22.205 neigh table: ip neigh ls 192.168.22.237 dev vlan4081 lladdr 00:25:90:fb:a6:8d REACHABLE 172.16.0.2 dev vlan1740 lladdr ac:1f:6b:2c:2e:5a REACHABLE and interfaces: 4: enp175s0f0: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff 5: enp175s0f1: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff 6: vlan4081@enp175s0f0: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff 7: vlan1740@enp175s0f1: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff 5: enp175s0f1: mtu 1500 xdp/id:5 qdisc mq state UP group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff inet6 fe80::ae1f:6bff:fe07:c891/64 scope link valid_lft forever preferred_lft forever 6: vlan4081@enp175s0f0: mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff inet 192.168.22.205/24 scope global vlan4081 valid_lft forever preferred_lft forever inet6 fe80::ae1f:6bff:fe07:c890/64 scope link valid_lft forever preferred_lft forever 7: vlan1740@enp175s0f1: mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff inet 172.16.0.1/30 scope global vlan1740 valid_lft forever preferred_lft forever inet6 fe80::ae1f:6bff:fe07:c891/64 scope link valid_lft forever preferred_lft forever xdp program detached: Receiving side tcpdump: 14:28:09.141233 IP 192.168.22.237 > 172.16.0.2: ICMP echo request, id 30227, seq 487, length 64 I can see icmp requests enabling xdp ./xdp_fwd enp175s0f1 enp175s0f0 4: enp175s0f0: mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff prog/xdp id 5 tag 3c231ff1e5e77f3f 5: enp175s0f1: mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff prog/xdp id 5 tag 3c231ff1e5e77f3f 6: vlan4081@enp175s0f0: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff 7: vlan1740@enp175s0f1: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff What hardware is this? Start with: echo 1 > /sys/kernel/debug/tracing/events/xdp/enable cat /sys/kernel/debug/tracing/trace_pipe >From there, you can check the FIB lookups: sysctl -w kernel.perf_event_max_stack=16 perf record -e fib:* -a -g -- sleep 5 perf script I just catch some weird behavior :) All was working fine for about 20k packets Then after xdp start to forward every 10 packets ping 172.16.0.2 -i 0.1 PING 172.16.0.2 (172.16.0.2) 56(84) bytes of data. 64 bytes from 172.16.0.2: icmp_seq=1 ttl=64 time=5.12 ms 64 bytes from 172.16.0.2: icmp_seq=9 ttl=64 time=5.20 ms 64 bytes from 172.16.0.2: icmp_seq=19 ttl=64 time=4.85 ms 64 bytes from 172.16.0.2: icmp_seq=29 ttl=64 time=4.91 ms 64 bytes from 172.16.0.2: icmp_seq=38 ttl=64 time=4.85 ms 64 bytes from 172.16.0.2: icmp_seq=48 ttl=64 time=5.00 ms ^C --- 172.16.0.2 ping statistics --- 55 packets transmitted, 6 received, 89% packet loss, time 5655ms rtt min/avg/max/mdev = 4.850/4.992/5.203/0.145 ms And again after some time back to normal ping 172.1
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 09.11.2018 o 05:52, Saeed Mahameed pisze: On Thu, 2018-11-08 at 17:42 -0700, David Ahern wrote: On 11/8/18 5:40 PM, Paweł Staszewski wrote: W dniu 08.11.2018 o 17:32, David Ahern pisze: On 11/8/18 9:27 AM, Paweł Staszewski wrote: What hardware is this? mellanox connectx 4 ethtool -i enp175s0f0 driver: mlx5_core version: 5.0-0 firmware-version: 12.21.1000 (SM_200101033) expansion-rom-version: bus-info: :af:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: no supports-priv-flags: yes ethtool -i enp175s0f1 driver: mlx5_core version: 5.0-0 firmware-version: 12.21.1000 (SM_200101033) expansion-rom-version: bus-info: :af:00.1 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: no supports-priv-flags: yes Start with: echo 1 > /sys/kernel/debug/tracing/events/xdp/enable cat /sys/kernel/debug/tracing/trace_pipe cat /sys/kernel/debug/tracing/trace_pipe -0 [045] ..s. 68469.467752: xdp_devmap_xmit: ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 from_ifindex=4 to_ifindex=5 err=-6 FIB lookup is good, the redirect is happening, but the mlx5 driver does not like it. I think the -6 is coming from the mlx5 driver and the packet is getting dropped. Perhaps this check in mlx5e_xdp_xmit: if (unlikely(sq_num >= priv->channels.num)) return -ENXIO; I removed that part and recompiled - but after running now xdp_fwd i have kernel pamic :) hh, no please don't do such thing :) yes - dirty "try" :) Code back in place :) It must be because the tx netdev has less tx queues than the rx netdev. or the rx netdev rings are bound to a high cpu indexes. anyway, best practice is to open #cores RX/TX netdev on both sides ethtool -L enp175s0f0 combined $(nproc) ethtool -L enp175s0f1 combined $(nproc) Ok now it is working. Time for some tests :) Thanks Jesper or one of the Mellanox folks needs to respond about the config needed to run XDP with this NIC. I don't have a 40G or 100G card to play with.
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 08.11.2018 o 17:32, David Ahern pisze: On 11/8/18 9:27 AM, Paweł Staszewski wrote: What hardware is this? mellanox connectx 4 ethtool -i enp175s0f0 driver: mlx5_core version: 5.0-0 firmware-version: 12.21.1000 (SM_200101033) expansion-rom-version: bus-info: :af:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: no supports-priv-flags: yes ethtool -i enp175s0f1 driver: mlx5_core version: 5.0-0 firmware-version: 12.21.1000 (SM_200101033) expansion-rom-version: bus-info: :af:00.1 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: no supports-priv-flags: yes Start with: echo 1 > /sys/kernel/debug/tracing/events/xdp/enable cat /sys/kernel/debug/tracing/trace_pipe cat /sys/kernel/debug/tracing/trace_pipe -0 [045] ..s. 68469.467752: xdp_devmap_xmit: ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 from_ifindex=4 to_ifindex=5 err=-6 FIB lookup is good, the redirect is happening, but the mlx5 driver does not like it. I think the -6 is coming from the mlx5 driver and the packet is getting dropped. Perhaps this check in mlx5e_xdp_xmit: if (unlikely(sq_num >= priv->channels.num)) return -ENXIO; I removed that part and recompiled - but after running now xdp_fwd i have kernel pamic :) swapper 0 [045] 68493.746274: fib:fib_table_lookup: table 254 oif 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0 7fff818c13b5 fib_table_lookup ([kernel.kallsyms]) swapper 0 [045] 68494.770287: fib:fib_table_lookup: table 254 oif 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0 7fff818c13b5 fib_table_lookup ([kernel.kallsyms]) swapper 0 [045] 68495.794304: fib:fib_table_lookup: table 254 oif 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0 7fff818c13b5 fib_table_lookup ([kernel.kallsyms]) swapper 0 [045] 68496.818308: fib:fib_table_lookup: table 254 oif 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0 7fff818c13b5 fib_table_lookup ([kernel.kallsyms]) swapper 0 [045] 68497.842313: fib:fib_table_lookup: table 254 oif 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0 7fff818c13b5 fib_table_lookup ([kernel.kallsyms])
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 03.11.2018 o 01:18, Paweł Staszewski pisze: W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze: On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote: W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze: On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote: Hi So maybee someone will be interested how linux kernel handles normal traffic (not pktgen :) ) Server HW configuration: CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT) Server software: FRR - as routing daemon enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node) enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node) Maximum traffic that server can handle: Bandwidth bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate \ iface Rx Tx Total = = enp175s0f1: 28.51 Gb/s 37.24 Gb/s 65.74 Gb/s enp175s0f0: 38.07 Gb/s 28.44 Gb/s 66.51 Gb/s --- --- total: 66.58 Gb/s 65.67 Gb/s 132.25 Gb/s Packets per second: bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate - iface Rx Tx Total = = enp175s0f1: 5248589.00 P/s 3486617.75 P/s 8735207.00 P/s enp175s0f0: 3557944.25 P/s 5232516.00 P/s 8790460.00 P/s --- --- total: 8806533.00 P/s 8719134.00 P/s 17525668.00 P/s After reaching that limits nics on the upstream side (more RX traffic) start to drop packets I just dont understand that server can't handle more bandwidth (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX side are increasing. Where do you see 40 Gb/s ? you showed that both ports on the same NIC ( same pcie link) are doing 66.58 Gb/s (RX) + 65.67 Gb/s (TX) = 132.25 Gb/s which aligns with your pcie link limit, what am i missing ? hmm yes that was my concern also - cause cant find anywhere informations about that bandwidth is uni or bidirectional - so if 126Gbit for x16 8GT is unidir - then bidir will be 126/2 ~68Gbit - which will fit total bw on both ports i think it is bidir So yes - we are hitting there other problem i think pcie is most probabbly bidirectional max bw 126Gbit so RX 126Gbit and at same time TX should be 126Gbit So one 2-port 100G card connectx4 replaced with two separate connectx5 placed in two different pcie x16 gen 3.0 lspci -vvv -s af:00.0 af:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5] Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- Latency: 0, Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 90 NUMA node: 1 Region 0: Memory at 39bffe00 (64-bit, prefetchable) [size=32M] Expansion ROM at ee60 [disabled] [size=1M] Capabilities: [60] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 4096 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB Ln
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 08.11.2018 o 17:32, David Ahern pisze: On 11/8/18 9:27 AM, Paweł Staszewski wrote: What hardware is this? mellanox connectx 4 ethtool -i enp175s0f0 driver: mlx5_core version: 5.0-0 firmware-version: 12.21.1000 (SM_200101033) expansion-rom-version: bus-info: :af:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: no supports-priv-flags: yes ethtool -i enp175s0f1 driver: mlx5_core version: 5.0-0 firmware-version: 12.21.1000 (SM_200101033) expansion-rom-version: bus-info: :af:00.1 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: no supports-priv-flags: yes Start with: echo 1 > /sys/kernel/debug/tracing/events/xdp/enable cat /sys/kernel/debug/tracing/trace_pipe cat /sys/kernel/debug/tracing/trace_pipe -0 [045] ..s. 68469.467752: xdp_devmap_xmit: ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 from_ifindex=4 to_ifindex=5 err=-6 FIB lookup is good, the redirect is happening, but the mlx5 driver does not like it. I think the -6 is coming from the mlx5 driver and the packet is getting dropped. Perhaps this check in mlx5e_xdp_xmit: if (unlikely(sq_num >= priv->channels.num)) return -ENXIO; Wondering about this: swapper 0 [045] 68494.770287: fib:fib_table_lookup: table 254 oif 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0 7fff818c13b5 fib_table_lookup ([kernel.kallsyms]) oif 0 ? Is that correct here ? swapper 0 [045] 68493.746274: fib:fib_table_lookup: table 254 oif 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0 7fff818c13b5 fib_table_lookup ([kernel.kallsyms]) swapper 0 [045] 68494.770287: fib:fib_table_lookup: table 254 oif 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0 7fff818c13b5 fib_table_lookup ([kernel.kallsyms]) swapper 0 [045] 68495.794304: fib:fib_table_lookup: table 254 oif 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0 7fff818c13b5 fib_table_lookup ([kernel.kallsyms]) swapper 0 [045] 68496.818308: fib:fib_table_lookup: table 254 oif 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0 7fff818c13b5 fib_table_lookup ([kernel.kallsyms]) swapper 0 [045] 68497.842313: fib:fib_table_lookup: table 254 oif 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0 7fff818c13b5 fib_table_lookup ([kernel.kallsyms])
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 08.11.2018 o 17:25, Paweł Staszewski pisze: W dniu 08.11.2018 o 17:06, David Ahern pisze: On 11/8/18 6:33 AM, Paweł Staszewski wrote: W dniu 07.11.2018 o 22:06, David Ahern pisze: On 11/3/18 6:24 PM, Paweł Staszewski wrote: Does your setup have any other device types besides physical ports with VLANs (e.g., any macvlans or bonds)? no. just phy(mlnx)->vlans only config VLAN and non-VLAN (and a mix) seem to work ok. Patches are here: https://github.com/dsahern/linux.git bpf/kernel-tables-wip I got lazy with the vlan exports; right now it requires 8021q to be builtin (CONFIG_VLAN_8021Q=y) You can use the xdp_fwd sample: make O=kbuild -C samples/bpf -j 8 Copy samples/bpf/xdp_fwd_kern.o and samples/bpf/xdp_fwd to the server and run: ./xdp_fwd e.g., in my testing I run: xdp_fwd eth1 eth2 eth3 eth4 All of the relevant forwarding ports need to be on the same command line. This version populates a second map to verify the egress port has XDP enabled. Installed today on some lab server with mellanox connectx4 And trying some simple static routing first - but after enabling xdp program - receiver is not receiving frames Route table is simple as possible for tests :) icmp ping test send from 192.168.22.237 to 172.16.0.2 - incomming packets on vlan 4081 ip r default via 192.168.22.236 dev vlan4081 172.16.0.0/30 dev vlan1740 proto kernel scope link src 172.16.0.1 192.168.22.0/24 dev vlan4081 proto kernel scope link src 192.168.22.205 neigh table: ip neigh ls 192.168.22.237 dev vlan4081 lladdr 00:25:90:fb:a6:8d REACHABLE 172.16.0.2 dev vlan1740 lladdr ac:1f:6b:2c:2e:5a REACHABLE and interfaces: 4: enp175s0f0: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff 5: enp175s0f1: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff 6: vlan4081@enp175s0f0: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff 7: vlan1740@enp175s0f1: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff 5: enp175s0f1: mtu 1500 xdp/id:5 qdisc mq state UP group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff inet6 fe80::ae1f:6bff:fe07:c891/64 scope link valid_lft forever preferred_lft forever 6: vlan4081@enp175s0f0: mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff inet 192.168.22.205/24 scope global vlan4081 valid_lft forever preferred_lft forever inet6 fe80::ae1f:6bff:fe07:c890/64 scope link valid_lft forever preferred_lft forever 7: vlan1740@enp175s0f1: mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff inet 172.16.0.1/30 scope global vlan1740 valid_lft forever preferred_lft forever inet6 fe80::ae1f:6bff:fe07:c891/64 scope link valid_lft forever preferred_lft forever xdp program detached: Receiving side tcpdump: 14:28:09.141233 IP 192.168.22.237 > 172.16.0.2: ICMP echo request, id 30227, seq 487, length 64 I can see icmp requests enabling xdp ./xdp_fwd enp175s0f1 enp175s0f0 4: enp175s0f0: mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff prog/xdp id 5 tag 3c231ff1e5e77f3f 5: enp175s0f1: mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff prog/xdp id 5 tag 3c231ff1e5e77f3f 6: vlan4081@enp175s0f0: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff 7: vlan1740@enp175s0f1: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff What hardware is this? mellanox connectx 4 ethtool -i enp175s0f0 driver: mlx5_core version: 5.0-0 firmware-version: 12.21.1000 (SM_200101033) expansion-rom-version: bus-info: :af:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: no supports-priv-flags: yes ethtool -i enp175s0f1 driver: mlx5_core version: 5.0-0 firmware-version: 12.21.1000 (SM_200101033) expansion-rom-version: bus-info: :af:00.1 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: no supports-priv-flags: yes Start with: echo 1 > /sys/kernel/debug/tracing/events/xdp/enable cat /sys/kernel/debug/tracing/trace_pipe cat /sys/kernel/debug/tracing/trace_pipe -0 [045] ..s. 68469.467752: xdp_devmap_xmit: ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 from_ifindex=4 to_ifindex=5 err=-6 -0 [045] ..s. 68470.483836: xdp_red
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 08.11.2018 o 17:06, David Ahern pisze: On 11/8/18 6:33 AM, Paweł Staszewski wrote: W dniu 07.11.2018 o 22:06, David Ahern pisze: On 11/3/18 6:24 PM, Paweł Staszewski wrote: Does your setup have any other device types besides physical ports with VLANs (e.g., any macvlans or bonds)? no. just phy(mlnx)->vlans only config VLAN and non-VLAN (and a mix) seem to work ok. Patches are here: https://github.com/dsahern/linux.git bpf/kernel-tables-wip I got lazy with the vlan exports; right now it requires 8021q to be builtin (CONFIG_VLAN_8021Q=y) You can use the xdp_fwd sample: make O=kbuild -C samples/bpf -j 8 Copy samples/bpf/xdp_fwd_kern.o and samples/bpf/xdp_fwd to the server and run: ./xdp_fwd e.g., in my testing I run: xdp_fwd eth1 eth2 eth3 eth4 All of the relevant forwarding ports need to be on the same command line. This version populates a second map to verify the egress port has XDP enabled. Installed today on some lab server with mellanox connectx4 And trying some simple static routing first - but after enabling xdp program - receiver is not receiving frames Route table is simple as possible for tests :) icmp ping test send from 192.168.22.237 to 172.16.0.2 - incomming packets on vlan 4081 ip r default via 192.168.22.236 dev vlan4081 172.16.0.0/30 dev vlan1740 proto kernel scope link src 172.16.0.1 192.168.22.0/24 dev vlan4081 proto kernel scope link src 192.168.22.205 neigh table: ip neigh ls 192.168.22.237 dev vlan4081 lladdr 00:25:90:fb:a6:8d REACHABLE 172.16.0.2 dev vlan1740 lladdr ac:1f:6b:2c:2e:5a REACHABLE and interfaces: 4: enp175s0f0: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff 5: enp175s0f1: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff 6: vlan4081@enp175s0f0: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff 7: vlan1740@enp175s0f1: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff 5: enp175s0f1: mtu 1500 xdp/id:5 qdisc mq state UP group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff inet6 fe80::ae1f:6bff:fe07:c891/64 scope link valid_lft forever preferred_lft forever 6: vlan4081@enp175s0f0: mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff inet 192.168.22.205/24 scope global vlan4081 valid_lft forever preferred_lft forever inet6 fe80::ae1f:6bff:fe07:c890/64 scope link valid_lft forever preferred_lft forever 7: vlan1740@enp175s0f1: mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff inet 172.16.0.1/30 scope global vlan1740 valid_lft forever preferred_lft forever inet6 fe80::ae1f:6bff:fe07:c891/64 scope link valid_lft forever preferred_lft forever xdp program detached: Receiving side tcpdump: 14:28:09.141233 IP 192.168.22.237 > 172.16.0.2: ICMP echo request, id 30227, seq 487, length 64 I can see icmp requests enabling xdp ./xdp_fwd enp175s0f1 enp175s0f0 4: enp175s0f0: mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff prog/xdp id 5 tag 3c231ff1e5e77f3f 5: enp175s0f1: mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff prog/xdp id 5 tag 3c231ff1e5e77f3f 6: vlan4081@enp175s0f0: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff 7: vlan1740@enp175s0f1: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff What hardware is this? Start with: echo 1 > /sys/kernel/debug/tracing/events/xdp/enable cat /sys/kernel/debug/tracing/trace_pipe cat /sys/kernel/debug/tracing/trace_pipe -0 [045] ..s. 68469.467752: xdp_devmap_xmit: ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 from_ifindex=4 to_ifindex=5 err=-6 -0 [045] ..s. 68470.483836: xdp_redirect_map: prog_id=30 action=REDIRECT ifindex=4 to_ifindex=5 err=0 map_id=32 map_index=5 -0 [045] ..s. 68470.483837: xdp_devmap_xmit: ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 from_ifindex=4 to_ifindex=5 err=-6 -0 [045] ..s. 68471.503853: xdp_redirect_map: prog_id=30 action=REDIRECT ifindex=4 to_ifindex=5 err=0 map_id=32 map_index=5 -0 [045] ..s. 68471.503853: xdp_devmap_xmit: ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 from_ifindex=4 to_ifindex=5 err=-6 -0 [045] ..s. 68472.527871: xdp_redirect_map: prog_id=30 action=REDIRECT
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 08.11.2018 o 01:59, Paweł Staszewski pisze: W dniu 05.11.2018 o 21:17, Jesper Dangaard Brouer pisze: On Sun, 4 Nov 2018 01:24:03 +0100 Paweł Staszewski wrote: And today again after allpy patch for page allocator - reached again 64/64 Gbit/s with only 50-60% cpu load Great. today no slowpath hit for netwoking :) But again dropped pckt at 64GbitRX and 64TX And as it should not be pcie express limit -i think something more is Well, this does sounds like a PCIe bandwidth limit to me. See the PCIe BW here: https://en.wikipedia.org/wiki/PCI_Express You likely have PCIe v3, where 1-lane have 984.6 MBytes/s or 7.87 Gbit/s Thus, x16-lanes have 15.75 GBytes or 126 Gbit/s. It does say "in each direction", but you are also forwarding this RX->TX on both (dual) ports NIC that is sharing the same PCIe slot. Network controller changed from 2-port 100G connectx4 to 2 separate cards 100G connectx5 PerfTop: 92239 irqs/sec kernel:99.4% exact: 0.0% [4000Hz cycles], (all, 56 CPUs) --- 6.65% [kernel] [k] irq_entries_start 5.57% [kernel] [k] tasklet_action_common.isra.21 4.60% [kernel] [k] mlx5_eq_int 4.04% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_linear 3.66% [kernel] [k] _raw_spin_lock_irqsave 3.58% [kernel] [k] mlx5e_sq_xmit 2.66% [kernel] [k] fib_table_lookup 2.52% [kernel] [k] _raw_spin_lock 2.51% [kernel] [k] build_skb 2.50% [kernel] [k] _raw_spin_lock_irq 2.04% [kernel] [k] try_to_wake_up 1.83% [kernel] [k] queued_spin_lock_slowpath 1.81% [kernel] [k] mlx5e_poll_tx_cq 1.65% [kernel] [k] do_idle 1.50% [kernel] [k] mlx5e_poll_rx_cq 1.34% [kernel] [k] __sched_text_start 1.32% [kernel] [k] cmd_exec 1.30% [kernel] [k] cmd_work_handler 1.16% [kernel] [k] vlan_do_receive 1.15% [kernel] [k] memcpy_erms 1.15% [kernel] [k] __dev_queue_xmit 1.07% [kernel] [k] mlx5_cmd_comp_handler 1.06% [kernel] [k] sched_ttwu_pending 1.00% [kernel] [k] ipt_do_table 0.98% [kernel] [k] ip_finish_output2 0.92% [kernel] [k] pfifo_fast_dequeue 0.88% [kernel] [k] mlx5e_handle_rx_cqe_mpwrq 0.78% [kernel] [k] dev_gro_receive 0.78% [kernel] [k] mlx5e_napi_poll 0.76% [kernel] [k] mlx5e_post_rx_mpwqes 0.70% [kernel] [k] process_one_work 0.67% [kernel] [k] __netif_receive_skb_core 0.65% [kernel] [k] __build_skb 0.63% [kernel] [k] llist_add_batch 0.62% [kernel] [k] tcp_gro_receive 0.60% [kernel] [k] inet_gro_receive 0.59% [kernel] [k] ip_route_input_rcu 0.59% [kernel] [k] rcu_irq_exit 0.56% [kernel] [k] napi_complete_done 0.52% [kernel] [k] kmem_cache_alloc 0.48% [kernel] [k] __softirqentry_text_start 0.48% [kernel] [k] mlx5e_xmit 0.47% [kernel] [k] __queue_work 0.46% [kernel] [k] memset_erms 0.46% [kernel] [k] dev_hard_start_xmit 0.45% [kernel] [k] insert_work 0.45% [kernel] [k] enqueue_task_fair 0.44% [kernel] [k] __wake_up_common 0.43% [kernel] [k] finish_task_switch 0.43% [kernel] [k] kmem_cache_free_bulk 0.42% [kernel] [k] ip_forward 0.42% [kernel] [k] worker_thread 0.41% [kernel] [k] schedule 0.41% [kernel] [k] _raw_spin_unlock_irqrestore 0.40% [kernel] [k] netif_skb_features 0.40% [kernel] [k] queue_work_on 0.40% [kernel] [k] pfifo_fast_enqueue 0.39% [kernel] [k] vlan_dev_hard_start_xmit 0.39% [kernel] [k] page_frag_free 0.36% [kernel] [k] swiotlb_map_page 0.36% [kernel] [k] update_cfs_rq_h_load 0.35% [kernel] [k] validate_xmit_skb.isra.142 0.35% [kernel] [k] dev_ifconf 0.35% [kernel] [k] check_preempt_curr 0.34% [kernel] [k] _raw_spin_trylock 0.34% [kernel] [k] rcu_idle_exit 0.33% [kernel] [k] ip_rcv_core.isra.20.constprop.25 0.33% [kernel] [k] __qdisc_run 0.33% [kernel] [k] skb_release_data 0.32% [kernel] [k] native_sched_clock 0.30% [kernel] [k] add_interrupt_randomness 0.29% [kernel] [k] interrupt_entry 0.28% [kernel] [k] skb_gro_receive 0.26% [kernel] [k] read_tsc 0.26% [kernel] [k] __get_xps_queue_idx 0.26% [kernel] [k] inet_gifconf 0.26% [kernel] [k] skb_segment 0.25% [ker
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 07.11.2018 o 22:06, David Ahern pisze: On 11/3/18 6:24 PM, Paweł Staszewski wrote: Does your setup have any other device types besides physical ports with VLANs (e.g., any macvlans or bonds)? no. just phy(mlnx)->vlans only config VLAN and non-VLAN (and a mix) seem to work ok. Patches are here: https://github.com/dsahern/linux.git bpf/kernel-tables-wip I got lazy with the vlan exports; right now it requires 8021q to be builtin (CONFIG_VLAN_8021Q=y) You can use the xdp_fwd sample: make O=kbuild -C samples/bpf -j 8 Copy samples/bpf/xdp_fwd_kern.o and samples/bpf/xdp_fwd to the server and run: ./xdp_fwd e.g., in my testing I run: xdp_fwd eth1 eth2 eth3 eth4 All of the relevant forwarding ports need to be on the same command line. This version populates a second map to verify the egress port has XDP enabled. Installed today on some lab server with mellanox connectx4 And trying some simple static routing first - but after enabling xdp program - receiver is not receiving frames Route table is simple as possible for tests :) icmp ping test send from 192.168.22.237 to 172.16.0.2 - incomming packets on vlan 4081 ip r default via 192.168.22.236 dev vlan4081 172.16.0.0/30 dev vlan1740 proto kernel scope link src 172.16.0.1 192.168.22.0/24 dev vlan4081 proto kernel scope link src 192.168.22.205 neigh table: ip neigh ls 192.168.22.237 dev vlan4081 lladdr 00:25:90:fb:a6:8d REACHABLE 172.16.0.2 dev vlan1740 lladdr ac:1f:6b:2c:2e:5a REACHABLE and interfaces: 4: enp175s0f0: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff 5: enp175s0f1: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff 6: vlan4081@enp175s0f0: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff 7: vlan1740@enp175s0f1: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff 5: enp175s0f1: mtu 1500 xdp/id:5 qdisc mq state UP group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff inet6 fe80::ae1f:6bff:fe07:c891/64 scope link valid_lft forever preferred_lft forever 6: vlan4081@enp175s0f0: mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff inet 192.168.22.205/24 scope global vlan4081 valid_lft forever preferred_lft forever inet6 fe80::ae1f:6bff:fe07:c890/64 scope link valid_lft forever preferred_lft forever 7: vlan1740@enp175s0f1: mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff inet 172.16.0.1/30 scope global vlan1740 valid_lft forever preferred_lft forever inet6 fe80::ae1f:6bff:fe07:c891/64 scope link valid_lft forever preferred_lft forever xdp program detached: Receiving side tcpdump: 14:28:09.141233 IP 192.168.22.237 > 172.16.0.2: ICMP echo request, id 30227, seq 487, length 64 I can see icmp requests enabling xdp ./xdp_fwd enp175s0f1 enp175s0f0 4: enp175s0f0: mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff prog/xdp id 5 tag 3c231ff1e5e77f3f 5: enp175s0f1: mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff prog/xdp id 5 tag 3c231ff1e5e77f3f 6: vlan4081@enp175s0f0: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff 7: vlan1740@enp175s0f1: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff Receiving side no icmp echo requests incommint to interface. And some ethtool stats for xdp interface trat receiving icmp requests from sender to be forwarded: ethtool -S enp175s0f0 | grep 'rx_xdp_redirect' rx_xdp_redirect: 321 ethtool stats for interface that should forward icmp requests to receiver on vlan id 1740 ethtool -S enp175s0f1 | grep 'tx_xdp' tx_xdp_xmit: 0 tx_xdp_full: 0 tx_xdp_err: 0 tx_xdp_cqes: 0 No frames tx-ed. And today again after allpy patch for page allocator - reached again 64/64 Gbit/s with only 50-60% cpu load you should see the cpu load drop considerably.
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 08.11.2018 o 01:59, Paweł Staszewski pisze: W dniu 05.11.2018 o 21:17, Jesper Dangaard Brouer pisze: On Sun, 4 Nov 2018 01:24:03 +0100 Paweł Staszewski wrote: And today again after allpy patch for page allocator - reached again 64/64 Gbit/s with only 50-60% cpu load Great. today no slowpath hit for netwoking :) But again dropped pckt at 64GbitRX and 64TX And as it should not be pcie express limit -i think something more is Well, this does sounds like a PCIe bandwidth limit to me. See the PCIe BW here: https://en.wikipedia.org/wiki/PCI_Express You likely have PCIe v3, where 1-lane have 984.6 MBytes/s or 7.87 Gbit/s Thus, x16-lanes have 15.75 GBytes or 126 Gbit/s. It does say "in each direction", but you are also forwarding this RX->TX on both (dual) ports NIC that is sharing the same PCIe slot. Network controller changed from 2-port 100G connectx4 to 2 separate cards 100G connectx5 PerfTop: 92239 irqs/sec kernel:99.4% exact: 0.0% [4000Hz cycles], (all, 56 CPUs) --- 6.65% [kernel] [k] irq_entries_start 5.57% [kernel] [k] tasklet_action_common.isra.21 4.60% [kernel] [k] mlx5_eq_int 4.04% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_linear 3.66% [kernel] [k] _raw_spin_lock_irqsave 3.58% [kernel] [k] mlx5e_sq_xmit 2.66% [kernel] [k] fib_table_lookup 2.52% [kernel] [k] _raw_spin_lock 2.51% [kernel] [k] build_skb 2.50% [kernel] [k] _raw_spin_lock_irq 2.04% [kernel] [k] try_to_wake_up 1.83% [kernel] [k] queued_spin_lock_slowpath 1.81% [kernel] [k] mlx5e_poll_tx_cq 1.65% [kernel] [k] do_idle 1.50% [kernel] [k] mlx5e_poll_rx_cq 1.34% [kernel] [k] __sched_text_start 1.32% [kernel] [k] cmd_exec 1.30% [kernel] [k] cmd_work_handler 1.16% [kernel] [k] vlan_do_receive 1.15% [kernel] [k] memcpy_erms 1.15% [kernel] [k] __dev_queue_xmit 1.07% [kernel] [k] mlx5_cmd_comp_handler 1.06% [kernel] [k] sched_ttwu_pending 1.00% [kernel] [k] ipt_do_table 0.98% [kernel] [k] ip_finish_output2 0.92% [kernel] [k] pfifo_fast_dequeue 0.88% [kernel] [k] mlx5e_handle_rx_cqe_mpwrq 0.78% [kernel] [k] dev_gro_receive 0.78% [kernel] [k] mlx5e_napi_poll 0.76% [kernel] [k] mlx5e_post_rx_mpwqes 0.70% [kernel] [k] process_one_work 0.67% [kernel] [k] __netif_receive_skb_core 0.65% [kernel] [k] __build_skb 0.63% [kernel] [k] llist_add_batch 0.62% [kernel] [k] tcp_gro_receive 0.60% [kernel] [k] inet_gro_receive 0.59% [kernel] [k] ip_route_input_rcu 0.59% [kernel] [k] rcu_irq_exit 0.56% [kernel] [k] napi_complete_done 0.52% [kernel] [k] kmem_cache_alloc 0.48% [kernel] [k] __softirqentry_text_start 0.48% [kernel] [k] mlx5e_xmit 0.47% [kernel] [k] __queue_work 0.46% [kernel] [k] memset_erms 0.46% [kernel] [k] dev_hard_start_xmit 0.45% [kernel] [k] insert_work 0.45% [kernel] [k] enqueue_task_fair 0.44% [kernel] [k] __wake_up_common 0.43% [kernel] [k] finish_task_switch 0.43% [kernel] [k] kmem_cache_free_bulk 0.42% [kernel] [k] ip_forward 0.42% [kernel] [k] worker_thread 0.41% [kernel] [k] schedule 0.41% [kernel] [k] _raw_spin_unlock_irqrestore 0.40% [kernel] [k] netif_skb_features 0.40% [kernel] [k] queue_work_on 0.40% [kernel] [k] pfifo_fast_enqueue 0.39% [kernel] [k] vlan_dev_hard_start_xmit 0.39% [kernel] [k] page_frag_free 0.36% [kernel] [k] swiotlb_map_page 0.36% [kernel] [k] update_cfs_rq_h_load 0.35% [kernel] [k] validate_xmit_skb.isra.142 0.35% [kernel] [k] dev_ifconf 0.35% [kernel] [k] check_preempt_curr 0.34% [kernel] [k] _raw_spin_trylock 0.34% [kernel] [k] rcu_idle_exit 0.33% [kernel] [k] ip_rcv_core.isra.20.constprop.25 0.33% [kernel] [k] __qdisc_run 0.33% [kernel] [k] skb_release_data 0.32% [kernel] [k] native_sched_clock 0.30% [kernel] [k] add_interrupt_randomness 0.29% [kernel] [k] interrupt_entry 0.28% [kernel] [k] skb_gro_receive 0.26% [kernel] [k] read_tsc 0.26% [kernel] [k] __get_xps_queue_idx 0.26% [kernel] [k] inet_gifconf 0.26% [kernel] [k] skb_segment 0.25% [ker
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 05.11.2018 o 21:17, Jesper Dangaard Brouer pisze: On Sun, 4 Nov 2018 01:24:03 +0100 Paweł Staszewski wrote: And today again after allpy patch for page allocator - reached again 64/64 Gbit/s with only 50-60% cpu load Great. today no slowpath hit for netwoking :) But again dropped pckt at 64GbitRX and 64TX And as it should not be pcie express limit -i think something more is Well, this does sounds like a PCIe bandwidth limit to me. See the PCIe BW here: https://en.wikipedia.org/wiki/PCI_Express You likely have PCIe v3, where 1-lane have 984.6 MBytes/s or 7.87 Gbit/s Thus, x16-lanes have 15.75 GBytes or 126 Gbit/s. It does say "in each direction", but you are also forwarding this RX->TX on both (dual) ports NIC that is sharing the same PCIe slot. Network controller changed from 2-port 100G connectx4 to 2 separate cards 100G connectx5 PerfTop: 92239 irqs/sec kernel:99.4% exact: 0.0% [4000Hz cycles], (all, 56 CPUs) --- 6.65% [kernel] [k] irq_entries_start 5.57% [kernel] [k] tasklet_action_common.isra.21 4.60% [kernel] [k] mlx5_eq_int 4.04% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_linear 3.66% [kernel] [k] _raw_spin_lock_irqsave 3.58% [kernel] [k] mlx5e_sq_xmit 2.66% [kernel] [k] fib_table_lookup 2.52% [kernel] [k] _raw_spin_lock 2.51% [kernel] [k] build_skb 2.50% [kernel] [k] _raw_spin_lock_irq 2.04% [kernel] [k] try_to_wake_up 1.83% [kernel] [k] queued_spin_lock_slowpath 1.81% [kernel] [k] mlx5e_poll_tx_cq 1.65% [kernel] [k] do_idle 1.50% [kernel] [k] mlx5e_poll_rx_cq 1.34% [kernel] [k] __sched_text_start 1.32% [kernel] [k] cmd_exec 1.30% [kernel] [k] cmd_work_handler 1.16% [kernel] [k] vlan_do_receive 1.15% [kernel] [k] memcpy_erms 1.15% [kernel] [k] __dev_queue_xmit 1.07% [kernel] [k] mlx5_cmd_comp_handler 1.06% [kernel] [k] sched_ttwu_pending 1.00% [kernel] [k] ipt_do_table 0.98% [kernel] [k] ip_finish_output2 0.92% [kernel] [k] pfifo_fast_dequeue 0.88% [kernel] [k] mlx5e_handle_rx_cqe_mpwrq 0.78% [kernel] [k] dev_gro_receive 0.78% [kernel] [k] mlx5e_napi_poll 0.76% [kernel] [k] mlx5e_post_rx_mpwqes 0.70% [kernel] [k] process_one_work 0.67% [kernel] [k] __netif_receive_skb_core 0.65% [kernel] [k] __build_skb 0.63% [kernel] [k] llist_add_batch 0.62% [kernel] [k] tcp_gro_receive 0.60% [kernel] [k] inet_gro_receive 0.59% [kernel] [k] ip_route_input_rcu 0.59% [kernel] [k] rcu_irq_exit 0.56% [kernel] [k] napi_complete_done 0.52% [kernel] [k] kmem_cache_alloc 0.48% [kernel] [k] __softirqentry_text_start 0.48% [kernel] [k] mlx5e_xmit 0.47% [kernel] [k] __queue_work 0.46% [kernel] [k] memset_erms 0.46% [kernel] [k] dev_hard_start_xmit 0.45% [kernel] [k] insert_work 0.45% [kernel] [k] enqueue_task_fair 0.44% [kernel] [k] __wake_up_common 0.43% [kernel] [k] finish_task_switch 0.43% [kernel] [k] kmem_cache_free_bulk 0.42% [kernel] [k] ip_forward 0.42% [kernel] [k] worker_thread 0.41% [kernel] [k] schedule 0.41% [kernel] [k] _raw_spin_unlock_irqrestore 0.40% [kernel] [k] netif_skb_features 0.40% [kernel] [k] queue_work_on 0.40% [kernel] [k] pfifo_fast_enqueue 0.39% [kernel] [k] vlan_dev_hard_start_xmit 0.39% [kernel] [k] page_frag_free 0.36% [kernel] [k] swiotlb_map_page 0.36% [kernel] [k] update_cfs_rq_h_load 0.35% [kernel] [k] validate_xmit_skb.isra.142 0.35% [kernel] [k] dev_ifconf 0.35% [kernel] [k] check_preempt_curr 0.34% [kernel] [k] _raw_spin_trylock 0.34% [kernel] [k] rcu_idle_exit 0.33% [kernel] [k] ip_rcv_core.isra.20.constprop.25 0.33% [kernel] [k] __qdisc_run 0.33% [kernel] [k] skb_release_data 0.32% [kernel] [k] native_sched_clock 0.30% [kernel] [k] add_interrupt_randomness 0.29% [kernel] [k] interrupt_entry 0.28% [kernel] [k] skb_gro_receive 0.26% [kernel] [k] read_tsc 0.26% [kernel] [k] __get_xps_queue_idx 0.26% [kernel] [k] inet_gifconf 0.26% [kernel] [k] skb_segment 0.25% [kernel] [k] __tasklet_schedule_common 0
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
n W dniu 03.11.2018 o 18:32, David Ahern pisze: On 11/1/18 11:30 AM, Paweł Staszewski wrote: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/bpf/xdp_fwd_kern.c I can try some tests on same hw but testlab configuration - will give it a try :) That version does not work with VLANs. I have patches for it but it needs a bit more work before sending out. Perhaps I can get back to it next week. Will be nice - next week i will be able to replace network controller and install separate two 100Gbit nics into two pciex x16 slots - so can test without hitting pcie bandwidth limits. Does your setup have any other device types besides physical ports with VLANs (e.g., any macvlans or bonds)? no. just phy(mlnx)->vlans only config And today again after allpy patch for page allocator - reached again 64/64 Gbit/s with only 50-60% cpu load today no slowpath hit for netwoking :) But again dropped pckt at 64GbitRX and 64TX And as it should not be pcie express limit -i think something more is going on there - and hard to catch - cause perf top doestn chenged besides there is no queued slowpath hit now I ordered now also intel cards to compare - but 3 weeks eta Faster - cause 3 days - i will have mellanox connectx 5 - so can separate traffic to two different x16 pcie busses
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 03.11.2018 o 16:23, Paweł Staszewski pisze: W dniu 03.11.2018 o 13:58, Jesper Dangaard Brouer pisze: On Sat, 3 Nov 2018 01:16:08 +0100 Paweł Staszewski wrote: W dniu 02.11.2018 o 20:02, Paweł Staszewski pisze: W dniu 02.11.2018 o 15:20, Aaron Lu pisze: On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote: On Fri, 2 Nov 2018 13:23:56 +0800 Aaron Lu wrote: On Thu, Nov 01, 2018 at 08:23:19PM +, Saeed Mahameed wrote: On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote: On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer wrote: ... ... [...] TL;DR: this is order-0 pages (code-walk trough proof below) To Aaron, the network stack *can* call __free_pages_ok() with order-0 pages, via: [...] I think here is a problem - order 0 pages are freed directly to buddy, bypassing per-cpu-pages. This might be the reason lock contention appeared on free path. Can someone apply below diff and see if lock contention is gone? Will test it tonight Patch applied perf report: https://ufile.io/sytfh But i need to wait also with more traffic currently cpu's are sleeping Well, that would be the expected result, that the CPUs get more time to sleep, if the lock contention is gone... What is the measured bandwidth now? 30 RX /30 TX Gbit/s Notice, you might still be limited by the PCIe bandwidth, but then your CPUs might actually decide to sleep, as they are getting data fast enough. Yes - i will replace network controller to two separate nic's in two separate x16 pcie But after monday. But i dont think i hit pcie limit there - it looks like pcie x16 gen3 have 16GB/s RX and 16GB/s TX so bidirectional Was thinking that maybee memory limit - but also there is 4 channel DDR4 2666MHz - so total bandwidth for memory is bigger (48GB/s) than needed for 100Gbit ethernet [...] diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e2ef1c17942f..65c0ae13215a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4554,8 +4554,14 @@ void page_frag_free(void *addr) { struct page *page = virt_to_head_page(addr); - if (unlikely(put_page_testzero(page))) - __free_pages_ok(page, compound_order(page)); + if (unlikely(put_page_testzero(page))) { + unsigned int order = compound_order(page); + + if (order == 0) + free_unref_page(page); + else + __free_pages_ok(page, order); + }
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 03.11.2018 o 13:58, Jesper Dangaard Brouer pisze: On Sat, 3 Nov 2018 01:16:08 +0100 Paweł Staszewski wrote: W dniu 02.11.2018 o 20:02, Paweł Staszewski pisze: W dniu 02.11.2018 o 15:20, Aaron Lu pisze: On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote: On Fri, 2 Nov 2018 13:23:56 +0800 Aaron Lu wrote: On Thu, Nov 01, 2018 at 08:23:19PM +, Saeed Mahameed wrote: On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote: On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer wrote: ... ... [...] TL;DR: this is order-0 pages (code-walk trough proof below) To Aaron, the network stack *can* call __free_pages_ok() with order-0 pages, via: [...] I think here is a problem - order 0 pages are freed directly to buddy, bypassing per-cpu-pages. This might be the reason lock contention appeared on free path. Can someone apply below diff and see if lock contention is gone? Will test it tonight Patch applied perf report: https://ufile.io/sytfh But i need to wait also with more traffic currently cpu's are sleeping Well, that would be the expected result, that the CPUs get more time to sleep, if the lock contention is gone... What is the measured bandwidth now? 30 RX /30 TX Gbit/s Notice, you might still be limited by the PCIe bandwidth, but then your CPUs might actually decide to sleep, as they are getting data fast enough. Yes - i will replace network controller to two separate nic's in two separate x16 pcie But after monday. But i dont think i hit pcie limit there - it looks like pcie x16 gen3 have 16GB/s RX and 16GB/s TX so bidirectional [...] diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e2ef1c17942f..65c0ae13215a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4554,8 +4554,14 @@ void page_frag_free(void *addr) { struct page *page = virt_to_head_page(addr); - if (unlikely(put_page_testzero(page))) - __free_pages_ok(page, compound_order(page)); + if (unlikely(put_page_testzero(page))) { + unsigned int order = compound_order(page); + + if (order == 0) + free_unref_page(page); + else + __free_pages_ok(page, order); + }
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 03.11.2018 o 01:16, Paweł Staszewski pisze: W dniu 02.11.2018 o 20:02, Paweł Staszewski pisze: W dniu 02.11.2018 o 15:20, Aaron Lu pisze: On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote: On Fri, 2 Nov 2018 13:23:56 +0800 Aaron Lu wrote: On Thu, Nov 01, 2018 at 08:23:19PM +, Saeed Mahameed wrote: On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote: On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer wrote: ... ... Section copied out: mlx5e_poll_tx_cq | --16.34%--napi_consume_skb | |--12.65%--__free_pages_ok | | | --11.86%--free_one_page | | | |--10.10% --queued_spin_lock_slowpath | | | --0.65%--_raw_spin_lock This callchain looks like it is freeing higher order pages than order 0: __free_pages_ok is only called for pages whose order are bigger than 0. mlx5 rx uses only order 0 pages, so i don't know where these high order tx SKBs are coming from.. Perhaps here: __netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and __napi_alloc_frag() will all call page_frag_alloc(), which will use __page_frag_cache_refill() to get an order 3 page if possible, or fall back to an order 0 page if order 3 page is not available. I'm not sure if your workload will use the above code path though. TL;DR: this is order-0 pages (code-walk trough proof below) To Aaron, the network stack *can* call __free_pages_ok() with order-0 pages, via: static void skb_free_head(struct sk_buff *skb) { unsigned char *head = skb->head; if (skb->head_frag) skb_free_frag(head); else kfree(head); } static inline void skb_free_frag(void *addr) { page_frag_free(addr); } /* * Frees a page fragment allocated out of either a compound or order 0 page. */ void page_frag_free(void *addr) { struct page *page = virt_to_head_page(addr); if (unlikely(put_page_testzero(page))) __free_pages_ok(page, compound_order(page)); } EXPORT_SYMBOL(page_frag_free); I think here is a problem - order 0 pages are freed directly to buddy, bypassing per-cpu-pages. This might be the reason lock contention appeared on free path. Can someone apply below diff and see if lock contention is gone? Will test it tonight Patch applied perf report: https://ufile.io/sytfh But i need to wait also with more traffic currently cpu's are sleeping before patch: | | | | |--13.55%--mlx5e_poll_tx_cq | | | | | | | | | | | --10.32%--napi_consume_skb | | | | | | | | | | | |--8.52%--__free_pages_ok | | | | | | | | | | | | | --7.67%--free_one_page | | | | | | | | | | | | | |--6.05%--queued_spin_lock_slowpath | | | | | | | | | | | | | --0.64%--_raw_spin_lock | | | | | | | | | | | |--0.77%--skb_release_data | | | | | | | | | | | --0.72%--page_frag_free after patch: | | | | | |--3.75%--mlx5e_poll_tx_cq | | | | | | | | | | | | | --1.53%--napi_consume_skb | | | | | | | | | | | | | --0.54%--skb_release_data | | | | | | | | | | | --3.09
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze: On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote: W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze: On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote: Hi So maybee someone will be interested how linux kernel handles normal traffic (not pktgen :) ) Server HW configuration: CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT) Server software: FRR - as routing daemon enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node) enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node) Maximum traffic that server can handle: Bandwidth bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate \ iface Rx TxTotal = = enp175s0f1: 28.51 Gb/s 37.24 Gb/s 65.74 Gb/s enp175s0f0: 38.07 Gb/s 28.44 Gb/s 66.51 Gb/s --- --- total: 66.58 Gb/s 65.67 Gb/s 132.25 Gb/s Packets per second: bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate - iface Rx TxTotal = = enp175s0f1: 5248589.00 P/s 3486617.75 P/s 8735207.00 P/s enp175s0f0: 3557944.25 P/s 5232516.00 P/s 8790460.00 P/s --- --- total: 8806533.00 P/s 8719134.00 P/s 17525668.00 P/s After reaching that limits nics on the upstream side (more RX traffic) start to drop packets I just dont understand that server can't handle more bandwidth (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX side are increasing. Where do you see 40 Gb/s ? you showed that both ports on the same NIC ( same pcie link) are doing 66.58 Gb/s (RX) + 65.67 Gb/s (TX) = 132.25 Gb/s which aligns with your pcie link limit, what am i missing ? hmm yes that was my concern also - cause cant find anywhere informations about that bandwidth is uni or bidirectional - so if 126Gbit for x16 8GT is unidir - then bidir will be 126/2 ~68Gbit - which will fit total bw on both ports i think it is bidir So yes - we are hitting there other problem i think pcie is most probabbly bidirectional max bw 126Gbit so RX 126Gbit and at same time TX should be 126Gbit This can explain maybee also why cpuload is rising rapidly from 120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net - so there can be some error in reading them when offloading (gro/gso/tso) on nic's is enabled that is why Was thinking that maybee reached some pcie x16 limit - but x16 8GT is 126Gbit - and also when testing with pktgen i can reach more bw and pps (like 4x more comparing to normal internet traffic) Are you forwarding when using pktgen as well or you just testing the RX side pps ? Yes pktgen was tested on single port RX Can check also forwarding to eliminate pciex limits So this explains why you have more RX pps, since tx is idle and pcie will be free to do only rx. [...] ethtool -S enp175s0f1 NIC statistics: rx_packets: 173730800927 rx_bytes: 99827422751332 tx_packets: 142532009512 tx_bytes: 184633045911222 tx_tso_packets: 25989113891 tx_tso_bytes: 132933363384458 tx_tso_inner_packets: 0 tx_tso_inner_bytes: 0 tx_added_vlan_packets: 74630239613 tx_nop: 2029817748 rx_lro_packets: 0 rx_lro_bytes: 0 rx_ecn_mark: 0 rx_removed_vlan_packets: 173730800927 rx_csum_unnecessary: 0 rx_csum_none: 434357 rx_csum_complete: 173730366570 rx_csum_unnecessary_inner: 0 rx_xdp_drop: 0 rx_xdp_redirect: 0 rx_xdp_tx_xmit: 0 rx_xdp_tx_full: 0 rx_xdp_tx_err: 0 rx_xdp_tx_cqe: 0 tx_csum_none: 38260960853 tx_csum_partial: 36369278774 tx_csum_partial_inner: 0 tx_queue_stopped: 1 tx_queue_dropped: 0 tx_xmit_more: 748638099 tx_recover: 0 tx_cqes: 73881645031 tx_queue_wake: 1 tx_udp_seg_rem: 0 tx_cqe_err: 0 tx_xdp_xmit: 0 tx_xdp_full: 0 tx_xdp_err: 0 tx_xdp_cqes: 0 rx_wqe_err: 0 rx_mpwqe_filler_cqes: 0 rx_mpwqe_filler_strides: 0 rx_buff_alloc_err: 0 rx_cqe_compress_blks: 0 rx_cqe_compress_pkts: 0 If this is a pcie bottleneck it might be useful to enable CQE compression (to reduce PCIe completion descriptors transactions) you s
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 02.11.2018 o 20:02, Paweł Staszewski pisze: W dniu 02.11.2018 o 15:20, Aaron Lu pisze: On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote: On Fri, 2 Nov 2018 13:23:56 +0800 Aaron Lu wrote: On Thu, Nov 01, 2018 at 08:23:19PM +, Saeed Mahameed wrote: On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote: On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer wrote: ... ... Section copied out: mlx5e_poll_tx_cq | --16.34%--napi_consume_skb | |--12.65%--__free_pages_ok | | | --11.86%--free_one_page | | | |--10.10% --queued_spin_lock_slowpath | | | --0.65%--_raw_spin_lock This callchain looks like it is freeing higher order pages than order 0: __free_pages_ok is only called for pages whose order are bigger than 0. mlx5 rx uses only order 0 pages, so i don't know where these high order tx SKBs are coming from.. Perhaps here: __netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and __napi_alloc_frag() will all call page_frag_alloc(), which will use __page_frag_cache_refill() to get an order 3 page if possible, or fall back to an order 0 page if order 3 page is not available. I'm not sure if your workload will use the above code path though. TL;DR: this is order-0 pages (code-walk trough proof below) To Aaron, the network stack *can* call __free_pages_ok() with order-0 pages, via: static void skb_free_head(struct sk_buff *skb) { unsigned char *head = skb->head; if (skb->head_frag) skb_free_frag(head); else kfree(head); } static inline void skb_free_frag(void *addr) { page_frag_free(addr); } /* * Frees a page fragment allocated out of either a compound or order 0 page. */ void page_frag_free(void *addr) { struct page *page = virt_to_head_page(addr); if (unlikely(put_page_testzero(page))) __free_pages_ok(page, compound_order(page)); } EXPORT_SYMBOL(page_frag_free); I think here is a problem - order 0 pages are freed directly to buddy, bypassing per-cpu-pages. This might be the reason lock contention appeared on free path. Can someone apply below diff and see if lock contention is gone? Will test it tonight Patch applied perf report: https://ufile.io/sytfh But i need to wait also with more traffic currently cpu's are sleeping diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e2ef1c17942f..65c0ae13215a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4554,8 +4554,14 @@ void page_frag_free(void *addr) { struct page *page = virt_to_head_page(addr); - if (unlikely(put_page_testzero(page))) - __free_pages_ok(page, compound_order(page)); + if (unlikely(put_page_testzero(page))) { + unsigned int order = compound_order(page); + + if (order == 0) + free_unref_page(page); + else + __free_pages_ok(page, order); + } } EXPORT_SYMBOL(page_frag_free); Notice for the mlx5 driver it support several RX-memory models, so it can be hard to follow, but from the perf report output we can see that is uses mlx5e_skb_from_cqe_linear, which use build_skb. --13.63%--mlx5e_skb_from_cqe_linear | --5.02%--build_skb | --1.85%--__build_skb | --1.00%--kmem_cache_alloc /* build_skb() is wrapper over __build_skb(), that specifically * takes care of skb->head and skb->pfmemalloc * This means that if @frag_size is not zero, then @data must be backed * by a page fragment, not kmalloc() or vmalloc() */ struct sk_buff *build_skb(void *data, unsigned int frag_size) { struct sk_buff *skb = __build_skb(data, frag_size); if (skb && frag_size) { skb->head_frag = 1; if (page_is_pfmemalloc(virt_to_head_page(data))) skb->pfmemalloc = 1; } return skb; } EXPORT_SYMBOL(build_skb); It still doesn't prove, that the @data is backed by by a order-0 page. For the mlx5 driver is uses mlx5e_page_alloc_mapped -> page_pool_dev_alloc_pages(), and I can see perf report using __page_pool_alloc_pages_slow(). The setup for page_pool in mlx5 uses order=0. /* Create a page_pool and register it with rxq */ pp_params.order = 0; pp_params.flags = 0; /* No-internal DMA mapping in page_pool */ pp_params.pool_size = pool_size; pp_params.nid = cpu_to_node(c->cpu); pp_params.dev = c->pdev; pp_params.dma_dir = rq->buff.map_dir; /* page_pool can be used even when there is no rq->xdp_prog, * given page_pool does not handle DMA mapping there is no * required state to clear. And page_pool gracefully handle * elevated refcnt. */ rq->page_p
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 02.11.2018 o 15:20, Aaron Lu pisze: On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote: On Fri, 2 Nov 2018 13:23:56 +0800 Aaron Lu wrote: On Thu, Nov 01, 2018 at 08:23:19PM +, Saeed Mahameed wrote: On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote: On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer wrote: ... ... Section copied out: mlx5e_poll_tx_cq | --16.34%--napi_consume_skb | |--12.65%--__free_pages_ok | | | --11.86%--free_one_page | | | |--10.10% --queued_spin_lock_slowpath | | | --0.65%--_raw_spin_lock This callchain looks like it is freeing higher order pages than order 0: __free_pages_ok is only called for pages whose order are bigger than 0. mlx5 rx uses only order 0 pages, so i don't know where these high order tx SKBs are coming from.. Perhaps here: __netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and __napi_alloc_frag() will all call page_frag_alloc(), which will use __page_frag_cache_refill() to get an order 3 page if possible, or fall back to an order 0 page if order 3 page is not available. I'm not sure if your workload will use the above code path though. TL;DR: this is order-0 pages (code-walk trough proof below) To Aaron, the network stack *can* call __free_pages_ok() with order-0 pages, via: static void skb_free_head(struct sk_buff *skb) { unsigned char *head = skb->head; if (skb->head_frag) skb_free_frag(head); else kfree(head); } static inline void skb_free_frag(void *addr) { page_frag_free(addr); } /* * Frees a page fragment allocated out of either a compound or order 0 page. */ void page_frag_free(void *addr) { struct page *page = virt_to_head_page(addr); if (unlikely(put_page_testzero(page))) __free_pages_ok(page, compound_order(page)); } EXPORT_SYMBOL(page_frag_free); I think here is a problem - order 0 pages are freed directly to buddy, bypassing per-cpu-pages. This might be the reason lock contention appeared on free path. Can someone apply below diff and see if lock contention is gone? Will test it tonight diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e2ef1c17942f..65c0ae13215a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4554,8 +4554,14 @@ void page_frag_free(void *addr) { struct page *page = virt_to_head_page(addr); - if (unlikely(put_page_testzero(page))) - __free_pages_ok(page, compound_order(page)); + if (unlikely(put_page_testzero(page))) { + unsigned int order = compound_order(page); + + if (order == 0) + free_unref_page(page); + else + __free_pages_ok(page, order); + } } EXPORT_SYMBOL(page_frag_free); Notice for the mlx5 driver it support several RX-memory models, so it can be hard to follow, but from the perf report output we can see that is uses mlx5e_skb_from_cqe_linear, which use build_skb. --13.63%--mlx5e_skb_from_cqe_linear | --5.02%--build_skb | --1.85%--__build_skb | --1.00%--kmem_cache_alloc /* build_skb() is wrapper over __build_skb(), that specifically * takes care of skb->head and skb->pfmemalloc * This means that if @frag_size is not zero, then @data must be backed * by a page fragment, not kmalloc() or vmalloc() */ struct sk_buff *build_skb(void *data, unsigned int frag_size) { struct sk_buff *skb = __build_skb(data, frag_size); if (skb && frag_size) { skb->head_frag = 1; if (page_is_pfmemalloc(virt_to_head_page(data))) skb->pfmemalloc = 1; } return skb; } EXPORT_SYMBOL(build_skb); It still doesn't prove, that the @data is backed by by a order-0 page. For the mlx5 driver is uses mlx5e_page_alloc_mapped -> page_pool_dev_alloc_pages(), and I can see perf report using __page_pool_alloc_pages_slow(). The setup for page_pool in mlx5 uses order=0. /* Create a page_pool and register it with rxq */ pp_params.order = 0; pp_params.flags = 0; /* No-internal DMA mapping in page_pool */ pp_params.pool_size = pool_size; pp_params.nid = cpu_to_node(c->cpu); pp_params.dev = c->pdev; pp_params.dma_dir = rq->buff.map_dir; /* page_pool can be used even when there is no rq->xdp_prog, * given page_pool does not handle DMA mapping there is no * required state to clear. And page_pool gracefully handle * elevated refcnt. */ rq->page_pool = page_pool_create(&pp_params
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 01.11.2018 o 22:24, Paweł Staszewski pisze: W dniu 01.11.2018 o 22:18, Paweł Staszewski pisze: W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze: On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote: W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze: On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote: Hi So maybee someone will be interested how linux kernel handles normal traffic (not pktgen :) ) Server HW configuration: CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT) Server software: FRR - as routing daemon enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node) enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node) Maximum traffic that server can handle: Bandwidth bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate \ iface Rx Tx Total = = enp175s0f1: 28.51 Gb/s 37.24 Gb/s 65.74 Gb/s enp175s0f0: 38.07 Gb/s 28.44 Gb/s 66.51 Gb/s --- --- total: 66.58 Gb/s 65.67 Gb/s 132.25 Gb/s Packets per second: bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate - iface Rx Tx Total = = enp175s0f1: 5248589.00 P/s 3486617.75 P/s 8735207.00 P/s enp175s0f0: 3557944.25 P/s 5232516.00 P/s 8790460.00 P/s --- --- total: 8806533.00 P/s 8719134.00 P/s 17525668.00 P/s After reaching that limits nics on the upstream side (more RX traffic) start to drop packets I just dont understand that server can't handle more bandwidth (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX side are increasing. Where do you see 40 Gb/s ? you showed that both ports on the same NIC ( same pcie link) are doing 66.58 Gb/s (RX) + 65.67 Gb/s (TX) = 132.25 Gb/s which aligns with your pcie link limit, what am i missing ? hmm yes that was my concern also - cause cant find anywhere informations about that bandwidth is uni or bidirectional - so if 126Gbit for x16 8GT is unidir - then bidir will be 126/2 ~68Gbit - which will fit total bw on both ports i think it is bidir This can explain maybee also why cpuload is rising rapidly from 120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net - so there can be some error in reading them when offloading (gro/gso/tso) on nic's is enabled that is why Was thinking that maybee reached some pcie x16 limit - but x16 8GT is 126Gbit - and also when testing with pktgen i can reach more bw and pps (like 4x more comparing to normal internet traffic) Are you forwarding when using pktgen as well or you just testing the RX side pps ? Yes pktgen was tested on single port RX Can check also forwarding to eliminate pciex limits So this explains why you have more RX pps, since tx is idle and pcie will be free to do only rx. [...] ethtool -S enp175s0f1 NIC statistics: rx_packets: 173730800927 rx_bytes: 99827422751332 tx_packets: 142532009512 tx_bytes: 184633045911222 tx_tso_packets: 25989113891 tx_tso_bytes: 132933363384458 tx_tso_inner_packets: 0 tx_tso_inner_bytes: 0 tx_added_vlan_packets: 74630239613 tx_nop: 2029817748 rx_lro_packets: 0 rx_lro_bytes: 0 rx_ecn_mark: 0 rx_removed_vlan_packets: 173730800927 rx_csum_unnecessary: 0 rx_csum_none: 434357 rx_csum_complete: 173730366570 rx_csum_unnecessary_inner: 0 rx_xdp_drop: 0 rx_xdp_redirect: 0 rx_xdp_tx_xmit: 0 rx_xdp_tx_full: 0 rx_xdp_tx_err: 0 rx_xdp_tx_cqe: 0 tx_csum_none: 38260960853 tx_csum_partial: 36369278774 tx_csum_partial_inner: 0 tx_queue_stopped: 1 tx_queue_dropped: 0 tx_xmit_more: 748638099 tx_recover: 0 tx_cqes: 73881645031 tx_queue_wake: 1 tx_udp_seg_rem: 0 tx_cqe_err: 0 tx_xdp_xmit: 0 tx_xdp_full: 0 tx_xdp_err: 0 tx_xdp_cqes: 0 rx_wqe_err: 0 rx_mpwqe_filler_cqes: 0 rx_mpwqe_filler_strides: 0 rx_buff_alloc_err: 0 rx_cqe_compress_blks: 0 rx_cqe_compress_pkts: 0 If this is a pcie bottleneck it might be useful to enable CQE compression (to reduce PCIe completion descriptors transactions) you should see the above rx_cqe_compress_pkts increasing when enabled. $ ethtool --set-priv-flags enp175s0f1
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 01.11.2018 o 22:18, Paweł Staszewski pisze: W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze: On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote: W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze: On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote: Hi So maybee someone will be interested how linux kernel handles normal traffic (not pktgen :) ) Server HW configuration: CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT) Server software: FRR - as routing daemon enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node) enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node) Maximum traffic that server can handle: Bandwidth bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate \ iface Rx Tx Total = = enp175s0f1: 28.51 Gb/s 37.24 Gb/s 65.74 Gb/s enp175s0f0: 38.07 Gb/s 28.44 Gb/s 66.51 Gb/s --- --- total: 66.58 Gb/s 65.67 Gb/s 132.25 Gb/s Packets per second: bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate - iface Rx Tx Total = = enp175s0f1: 5248589.00 P/s 3486617.75 P/s 8735207.00 P/s enp175s0f0: 3557944.25 P/s 5232516.00 P/s 8790460.00 P/s --- --- total: 8806533.00 P/s 8719134.00 P/s 17525668.00 P/s After reaching that limits nics on the upstream side (more RX traffic) start to drop packets I just dont understand that server can't handle more bandwidth (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX side are increasing. Where do you see 40 Gb/s ? you showed that both ports on the same NIC ( same pcie link) are doing 66.58 Gb/s (RX) + 65.67 Gb/s (TX) = 132.25 Gb/s which aligns with your pcie link limit, what am i missing ? hmm yes that was my concern also - cause cant find anywhere informations about that bandwidth is uni or bidirectional - so if 126Gbit for x16 8GT is unidir - then bidir will be 126/2 ~68Gbit - which will fit total bw on both ports i think it is bidir This can explain maybee also why cpuload is rising rapidly from 120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net - so there can be some error in reading them when offloading (gro/gso/tso) on nic's is enabled that is why Was thinking that maybee reached some pcie x16 limit - but x16 8GT is 126Gbit - and also when testing with pktgen i can reach more bw and pps (like 4x more comparing to normal internet traffic) Are you forwarding when using pktgen as well or you just testing the RX side pps ? Yes pktgen was tested on single port RX Can check also forwarding to eliminate pciex limits So this explains why you have more RX pps, since tx is idle and pcie will be free to do only rx. [...] ethtool -S enp175s0f1 NIC statistics: rx_packets: 173730800927 rx_bytes: 99827422751332 tx_packets: 142532009512 tx_bytes: 184633045911222 tx_tso_packets: 25989113891 tx_tso_bytes: 132933363384458 tx_tso_inner_packets: 0 tx_tso_inner_bytes: 0 tx_added_vlan_packets: 74630239613 tx_nop: 2029817748 rx_lro_packets: 0 rx_lro_bytes: 0 rx_ecn_mark: 0 rx_removed_vlan_packets: 173730800927 rx_csum_unnecessary: 0 rx_csum_none: 434357 rx_csum_complete: 173730366570 rx_csum_unnecessary_inner: 0 rx_xdp_drop: 0 rx_xdp_redirect: 0 rx_xdp_tx_xmit: 0 rx_xdp_tx_full: 0 rx_xdp_tx_err: 0 rx_xdp_tx_cqe: 0 tx_csum_none: 38260960853 tx_csum_partial: 36369278774 tx_csum_partial_inner: 0 tx_queue_stopped: 1 tx_queue_dropped: 0 tx_xmit_more: 748638099 tx_recover: 0 tx_cqes: 73881645031 tx_queue_wake: 1 tx_udp_seg_rem: 0 tx_cqe_err: 0 tx_xdp_xmit: 0 tx_xdp_full: 0 tx_xdp_err: 0 tx_xdp_cqes: 0 rx_wqe_err: 0 rx_mpwqe_filler_cqes: 0 rx_mpwqe_filler_strides: 0 rx_buff_alloc_err: 0 rx_cqe_compress_blks: 0 rx_cqe_compress_pkts: 0 If this is a pcie bottleneck it might be useful to enable CQE compression (to reduce PCIe completion descriptors transactions) you should see the above rx_cqe_compress_pkts increasing when enabled. $ ethtool --set-priv-flags enp175s0f1 rx_cqe
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze: On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote: W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze: On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote: Hi So maybee someone will be interested how linux kernel handles normal traffic (not pktgen :) ) Server HW configuration: CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT) Server software: FRR - as routing daemon enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node) enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node) Maximum traffic that server can handle: Bandwidth bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate \ iface Rx TxTotal = = enp175s0f1: 28.51 Gb/s 37.24 Gb/s 65.74 Gb/s enp175s0f0: 38.07 Gb/s 28.44 Gb/s 66.51 Gb/s --- --- total: 66.58 Gb/s 65.67 Gb/s 132.25 Gb/s Packets per second: bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate - iface Rx TxTotal = = enp175s0f1: 5248589.00 P/s 3486617.75 P/s 8735207.00 P/s enp175s0f0: 3557944.25 P/s 5232516.00 P/s 8790460.00 P/s --- --- total: 8806533.00 P/s 8719134.00 P/s 17525668.00 P/s After reaching that limits nics on the upstream side (more RX traffic) start to drop packets I just dont understand that server can't handle more bandwidth (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX side are increasing. Where do you see 40 Gb/s ? you showed that both ports on the same NIC ( same pcie link) are doing 66.58 Gb/s (RX) + 65.67 Gb/s (TX) = 132.25 Gb/s which aligns with your pcie link limit, what am i missing ? hmm yes that was my concern also - cause cant find anywhere informations about that bandwidth is uni or bidirectional - so if 126Gbit for x16 8GT is unidir - then bidir will be 126/2 ~68Gbit - which will fit total bw on both ports i think it is bidir This can explain maybee also why cpuload is rising rapidly from 120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net - so there can be some error in reading them when offloading (gro/gso/tso) on nic's is enabled that is why Was thinking that maybee reached some pcie x16 limit - but x16 8GT is 126Gbit - and also when testing with pktgen i can reach more bw and pps (like 4x more comparing to normal internet traffic) Are you forwarding when using pktgen as well or you just testing the RX side pps ? Yes pktgen was tested on single port RX Can check also forwarding to eliminate pciex limits So this explains why you have more RX pps, since tx is idle and pcie will be free to do only rx. [...] ethtool -S enp175s0f1 NIC statistics: rx_packets: 173730800927 rx_bytes: 99827422751332 tx_packets: 142532009512 tx_bytes: 184633045911222 tx_tso_packets: 25989113891 tx_tso_bytes: 132933363384458 tx_tso_inner_packets: 0 tx_tso_inner_bytes: 0 tx_added_vlan_packets: 74630239613 tx_nop: 2029817748 rx_lro_packets: 0 rx_lro_bytes: 0 rx_ecn_mark: 0 rx_removed_vlan_packets: 173730800927 rx_csum_unnecessary: 0 rx_csum_none: 434357 rx_csum_complete: 173730366570 rx_csum_unnecessary_inner: 0 rx_xdp_drop: 0 rx_xdp_redirect: 0 rx_xdp_tx_xmit: 0 rx_xdp_tx_full: 0 rx_xdp_tx_err: 0 rx_xdp_tx_cqe: 0 tx_csum_none: 38260960853 tx_csum_partial: 36369278774 tx_csum_partial_inner: 0 tx_queue_stopped: 1 tx_queue_dropped: 0 tx_xmit_more: 748638099 tx_recover: 0 tx_cqes: 73881645031 tx_queue_wake: 1 tx_udp_seg_rem: 0 tx_cqe_err: 0 tx_xdp_xmit: 0 tx_xdp_full: 0 tx_xdp_err: 0 tx_xdp_cqes: 0 rx_wqe_err: 0 rx_mpwqe_filler_cqes: 0 rx_mpwqe_filler_strides: 0 rx_buff_alloc_err: 0 rx_cqe_compress_blks: 0 rx_cqe_compress_pkts: 0 If this is a pcie bottleneck it might be useful to enable CQE compression (to reduce PCIe completion descriptors transactions) you should see the above rx_cqe_compress_pkts increasing when enabled. $ ethtool --set-priv-flags enp175s0f1 rx_cqe_compress on $ ethtool --show-priv-flags enp175s
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 01.11.2018 o 18:23, David Ahern pisze: On 11/1/18 7:52 AM, Paweł Staszewski wrote: W dniu 01.11.2018 o 11:55, Jesper Dangaard Brouer pisze: On Wed, 31 Oct 2018 21:37:16 -0600 David Ahern wrote: This is mainly a forwarding use case? Seems so based on the perf report. I suspect forwarding with XDP would show pretty good improvement. Yes, significant performance improvements. Notice Davids talk: "Leveraging Kernel Tables with XDP" http://vger.kernel.org/lpc-networking2018.html#session-1 It will be rly interesting It's pushing the exact use case you have: FRR manages the FIB, XDP programs get access to updates as they happen for fast path forwarding. Cant wait then :) It looks like that you are doing "pure" IP-routing, without any iptables conntrack stuff (from your perf report data). That will actually be a really good use-case for accelerating this with XDP. Yes pure IP routing iptables used only for some local input filtering. I want you to understand the philosophy behind how David and I want people to leverage XDP. Think of XDP as a software offload layer for the kernel network stack. Setup and use Linux kernel network stack, but accelerate parts of it with XDP, e.g. the route FIB lookup. Sample code avail here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/bpf/xdp_fwd_kern.c I can try some tests on same hw but testlab configuration - will give it a try :) That version does not work with VLANs. I have patches for it but it needs a bit more work before sending out. Perhaps I can get back to it next week. Will be nice - next week i will be able to replace network controller and install separate two 100Gbit nics into two pciex x16 slots - so can test without hitting pcie bandwidth limits.
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 01.11.2018 o 12:09, Paweł Staszewski pisze: rx_cqe_compress_pkts: 0 If this is a pcie bottleneck it might be useful to enable CQE compression (to reduce PCIe completion descriptors transactions) you should see the above rx_cqe_compress_pkts increasing when enabled. $ ethtool --set-priv-flags enp175s0f1 rx_cqe_compress on $ ethtool --show-priv-flags enp175s0f1 Private flags for p6p1: rx_cqe_moder : on cqe_moder : off rx_cqe_compress : on ... try this on both interfaces. Done ethtool --show-priv-flags enp175s0f1 Private flags for enp175s0f1: rx_cqe_moder : on tx_cqe_moder : off rx_cqe_compress : on rx_striding_rq : off rx_no_csum_complete: off ethtool --show-priv-flags enp175s0f0 Private flags for enp175s0f0: rx_cqe_moder : on tx_cqe_moder : off rx_cqe_compress : on rx_striding_rq : off rx_no_csum_complete: off Enabling cqe compress changes nothing after reaching 64Gbit RX / 64Gbit/s TX on interfaces cpu's are saturated at 100% ethtool -S enp175s0f1 | grep rx_cqe_compress rx_cqe_compress_blks: 5657836379 rx_cqe_compress_pkts: 13153761080 ethtool -S enp175s0f0 | grep rx_cqe_compress rx_cqe_compress_blks: 5994612500 rx_cqe_compress_pkts: 13579014869 bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate - iface Rx Tx Total == enp175s0f1: 27.03 Gb/s 37.09 Gb/s 64.12 Gb/s enp175s0f0: 36.84 Gb/s 26.82 Gb/s 63.66 Gb/s -- total: 63.85 Gb/s 63.87 Gb/s 127.72 Gb/s bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate / iface Rx Tx Total == enp175s0f1: 3.22 GB/s 4.26 GB/s 7.48 GB/s enp175s0f0: 4.24 GB/s 3.21 GB/s 7.45 GB/s -- total: 7.46 GB/s 7.47 GB/s 14.93 GB/s mpstat Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all 0.05 0.00 0.19 0.02 0.00 42.74 0.00 0.00 0.00 56.99 Average: 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 Average: 1 0.00 0.00 0.30 0.00 0.00 0.00 0.00 0.00 0.00 99.70 Average: 2 0.00 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 99.80 Average: 3 0.00 0.00 0.20 1.20 0.00 0.00 0.00 0.00 0.00 98.60 Average: 4 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.90 Average: 5 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 99.90 Average: 6 0.10 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 99.70 Average: 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 Average: 8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 Average: 9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 Average: 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 Average: 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 Average: 12 1.40 0.00 4.50 0.00 0.00 0.00 0.00 0.00 0.00 94.10 Average: 13 0.00 0.00 1.60 0.00 0.00 0.00 0.00 0.00 0.00 98.40 Average: 14 0.00 0.00 0.00 0.00 0.00 84.10 0.00 0.00 0.00 15.90 Average: 15 0.00 0.00 0.10 0.00 0.00 93.70 0.00 0.00 0.00 6.20 Average: 16 0.00 0.00 0.10 0.00 0.00 94.31 0.00 0.00 0.00 5.59 Average: 17 0.00 0.00 0.00 0.00 0.00 95.30 0.00 0.00 0.00 4.70 Average: 18 0.00 0.00 0.00 0.00 0.00 62.80 0.00 0.00 0.00 37.20 Average: 19 0.00 0.00 0.10 0.00 0.00 98.90 0.00 0.00 0.00 1.00 Average: 20 0.00 0.00 0.00 0.00 0.00 99.30 0.00 0.00 0.00 0.70 Average: 21 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00 Average: 22 0.00 0.00 0.00 0.00 0.00 99.90 0.00 0.00 0.00 0.10 Average: 23 0.00 0.00 0.10 0.00 0.00 99.90 0.00 0.00 0.00 0.00 Average: 24 0.00 0.00 0.10 0.00 0.00 97.10 0.00 0.00 0.00 2.80 Average: 2
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 01.11.2018 o 11:55, Jesper Dangaard Brouer pisze: On Wed, 31 Oct 2018 21:37:16 -0600 David Ahern wrote: This is mainly a forwarding use case? Seems so based on the perf report. I suspect forwarding with XDP would show pretty good improvement. Yes, significant performance improvements. Notice Davids talk: "Leveraging Kernel Tables with XDP" http://vger.kernel.org/lpc-networking2018.html#session-1 It will be rly interesting It looks like that you are doing "pure" IP-routing, without any iptables conntrack stuff (from your perf report data). That will actually be a really good use-case for accelerating this with XDP. Yes pure IP routing iptables used only for some local input filtering. I want you to understand the philosophy behind how David and I want people to leverage XDP. Think of XDP as a software offload layer for the kernel network stack. Setup and use Linux kernel network stack, but accelerate parts of it with XDP, e.g. the route FIB lookup. Sample code avail here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/bpf/xdp_fwd_kern.c I can try some tests on same hw but testlab configuration - will give it a try :) (I do warn, what we just found a bug/crash in setup+tairdown for the mlx5 driver you are using, that we/mlx _will_ fix soon) Ok You need the vlan changes I have queued up though. I know Yoel will be very interested in those changes too! I've convinced Yoel to write an XDP program for his Border Network Gateway (BNG) production system[1], and his is a heavy VLAN user. And the plan is to Open Source this when he have-something-working. [1] https://www.version2.dk/blog/software-router-del-5-linux-bng-1086060 Ok - for now i need to split traffic into two separate 100G ports placed in two different x16 pciexpress slots to check if the problem is mainly caused by no more pciex x16 bandwidth available.
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 01.11.2018 o 10:22, Jesper Dangaard Brouer pisze: On Wed, 31 Oct 2018 23:20:01 +0100 Paweł Staszewski wrote: W dniu 31.10.2018 o 23:09, Eric Dumazet pisze: On 10/31/2018 02:57 PM, Paweł Staszewski wrote: Hi So maybee someone will be interested how linux kernel handles normal traffic (not pktgen :) ) Pawel is this live production traffic? Yes moved server from testlab to production to check (risking a little - but this is traffic switched to backup router : ) ) I know Yoel (Cc) is very interested to know the real-life limitation of Linux as a router, especially with VLANs like you use. So yes this is real-life traffic , real users - normal mixed internet traffic forwarded (including ddos-es :) ) Server HW configuration: CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT) Server software: FRR - as routing daemon enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node) enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node) Maximum traffic that server can handle: Bandwidth bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate \ iface Rx Tx Total == enp175s0f1: 28.51 Gb/s 37.24 Gb/s 65.74 Gb/s enp175s0f0: 38.07 Gb/s 28.44 Gb/s 66.51 Gb/s -- total: 66.58 Gb/s 65.67 Gb/s 132.25 Gb/s Actually rather impressive number for a Linux router. Packets per second: bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate - iface Rx Tx Total == enp175s0f1: 5248589.00 P/s 3486617.75 P/s 8735207.00 P/s enp175s0f0: 3557944.25 P/s 5232516.00 P/s 8790460.00 P/s -- total: 8806533.00 P/s 8719134.00 P/s 17525668.00 P/s Average packet size: (28.51*10^9/8)/5248589 = 678.99 bytes (38.07*10^9/8)/3557944 = 1337.49 bytes After reaching that limits nics on the upstream side (more RX traffic) start to drop packets I just dont understand that server can't handle more bandwidth (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX side are increasing. Was thinking that maybee reached some pcie x16 limit - but x16 8GT is 126Gbit - and also when testing with pktgen i can reach more bw and pps (like 4x more comparing to normal internet traffic) And wondering if there is something that can be improved here. Some more informations / counters / stats and perf top below: Perf top flame graph: https://uploadfiles.io/7zo6u Thanks a lot for the flame graph! System configuration(long): cat /sys/devices/system/node/node1/cpulist 14-27,42-55 cat /sys/class/net/enp175s0f0/device/numa_node 1 cat /sys/class/net/enp175s0f1/device/numa_node 1 Hint grep can give you nicer output that cat: $ grep -H . /sys/class/net/*/device/numa_node Sure: grep -H . /sys/class/net/*/device/numa_node /sys/class/net/enp175s0f0/device/numa_node:1 /sys/class/net/enp175s0f1/device/numa_node:1 ip -s -d link ls dev enp175s0f0 6: enp175s0f0: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192 link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535 RX: bytes packets errors dropped overrun mcast 184142375840858 141347715974 2 2806325 0 85050528 TX: bytes packets errors dropped carrier collsns 99270697277430 172227994003 0 0 0 0 ip -s -d link ls dev enp175s0f1 7: enp175s0f1: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192 link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535 RX: bytes packets errors dropped overrun mcast 99686284170801 173507590134 61 669685 0 100304421 TX: bytes packets errors dropped carrier collsns 184435107970545 142383178304 0 0 0 0 You have increased the default (1000) qlen to 8192, why? Was checking if higher txq will change anything But no change for settings 1000,4096,8192 But yes i do not use there any traffic shaping like hfsc/hdb etc - just default qdisc mq 0: root pfifp_fast tc qdisc show dev enp175s0f1 qdisc mq 0: root qdisc pfifo_fast 0: parent :38 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 31.10.2018 o 23:20, Paweł Staszewski pisze: W dniu 31.10.2018 o 23:09, Eric Dumazet pisze: On 10/31/2018 02:57 PM, Paweł Staszewski wrote: Hi So maybee someone will be interested how linux kernel handles normal traffic (not pktgen :) ) Server HW configuration: CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT) Server software: FRR - as routing daemon enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node) enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node) Maximum traffic that server can handle: Bandwidth bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate \ iface Rx Tx Total == enp175s0f1: 28.51 Gb/s 37.24 Gb/s 65.74 Gb/s enp175s0f0: 38.07 Gb/s 28.44 Gb/s 66.51 Gb/s -- total: 66.58 Gb/s 65.67 Gb/s 132.25 Gb/s Packets per second: bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate - iface Rx Tx Total == enp175s0f1: 5248589.00 P/s 3486617.75 P/s 8735207.00 P/s enp175s0f0: 3557944.25 P/s 5232516.00 P/s 8790460.00 P/s -- total: 8806533.00 P/s 8719134.00 P/s 17525668.00 P/s After reaching that limits nics on the upstream side (more RX traffic) start to drop packets I just dont understand that server can't handle more bandwidth (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX side are increasing. Was thinking that maybee reached some pcie x16 limit - but x16 8GT is 126Gbit - and also when testing with pktgen i can reach more bw and pps (like 4x more comparing to normal internet traffic) And wondering if there is something that can be improved here. Some more informations / counters / stats and perf top below: Perf top flame graph: https://uploadfiles.io/7zo6u System configuration(long): cat /sys/devices/system/node/node1/cpulist 14-27,42-55 cat /sys/class/net/enp175s0f0/device/numa_node 1 cat /sys/class/net/enp175s0f1/device/numa_node 1 ip -s -d link ls dev enp175s0f0 6: enp175s0f0: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192 link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535 RX: bytes packets errors dropped overrun mcast 184142375840858 141347715974 2 2806325 0 85050528 TX: bytes packets errors dropped carrier collsns 99270697277430 172227994003 0 0 0 0 ip -s -d link ls dev enp175s0f1 7: enp175s0f1: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192 link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535 RX: bytes packets errors dropped overrun mcast 99686284170801 173507590134 61 669685 0 100304421 TX: bytes packets errors dropped carrier collsns 184435107970545 142383178304 0 0 0 0 ./softnet.sh cpu total dropped squeezed collision rps flow_limit PerfTop: 108490 irqs/sec kernel:99.6% exact: 0.0% [4000Hz cycles], (all, 56 CPUs) --- 26.78% [kernel] [k] queued_spin_lock_slowpath This is highly suspect. A call graph (perf record -a -g sleep 1; perf report --stdio) would tell what is going on. perf report: https://ufile.io/rqp0h With that many TX/RX queues, I would expect you to not use RPS/RFS, and have a 1/1 RX/TX mapping, so I do not know what could request a spinlock contention. And yes there is no RPF/RFS - just 1/1 RX/TX and affinity mapping on local cpu for the network controller for 28 RX+TX queues per nic .
Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
W dniu 31.10.2018 o 23:09, Eric Dumazet pisze: On 10/31/2018 02:57 PM, Paweł Staszewski wrote: Hi So maybee someone will be interested how linux kernel handles normal traffic (not pktgen :) ) Server HW configuration: CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT) Server software: FRR - as routing daemon enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node) enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node) Maximum traffic that server can handle: Bandwidth bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate \ iface Rx Tx Total == enp175s0f1: 28.51 Gb/s 37.24 Gb/s 65.74 Gb/s enp175s0f0: 38.07 Gb/s 28.44 Gb/s 66.51 Gb/s -- total: 66.58 Gb/s 65.67 Gb/s 132.25 Gb/s Packets per second: bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help input: /proc/net/dev type: rate - iface Rx Tx Total == enp175s0f1: 5248589.00 P/s 3486617.75 P/s 8735207.00 P/s enp175s0f0: 3557944.25 P/s 5232516.00 P/s 8790460.00 P/s -- total: 8806533.00 P/s 8719134.00 P/s 17525668.00 P/s After reaching that limits nics on the upstream side (more RX traffic) start to drop packets I just dont understand that server can't handle more bandwidth (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX side are increasing. Was thinking that maybee reached some pcie x16 limit - but x16 8GT is 126Gbit - and also when testing with pktgen i can reach more bw and pps (like 4x more comparing to normal internet traffic) And wondering if there is something that can be improved here. Some more informations / counters / stats and perf top below: Perf top flame graph: https://uploadfiles.io/7zo6u System configuration(long): cat /sys/devices/system/node/node1/cpulist 14-27,42-55 cat /sys/class/net/enp175s0f0/device/numa_node 1 cat /sys/class/net/enp175s0f1/device/numa_node 1 ip -s -d link ls dev enp175s0f0 6: enp175s0f0: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192 link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535 RX: bytes packets errors dropped overrun mcast 184142375840858 141347715974 2 2806325 0 85050528 TX: bytes packets errors dropped carrier collsns 99270697277430 172227994003 0 0 0 0 ip -s -d link ls dev enp175s0f1 7: enp175s0f1: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192 link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535 RX: bytes packets errors dropped overrun mcast 99686284170801 173507590134 61 669685 0 100304421 TX: bytes packets errors dropped carrier collsns 184435107970545 142383178304 0 0 0 0 ./softnet.sh cpu total dropped squeezed collision rps flow_limit PerfTop: 108490 irqs/sec kernel:99.6% exact: 0.0% [4000Hz cycles], (all, 56 CPUs) --- 26.78% [kernel] [k] queued_spin_lock_slowpath This is highly suspect. A call graph (perf record -a -g sleep 1; perf report --stdio) would tell what is going on. perf report: https://ufile.io/rqp0h With that many TX/RX queues, I would expect you to not use RPS/RFS, and have a 1/1 RX/TX mapping, so I do not know what could request a spinlock contention.
Re: Latest net-next kernel 4.19.0+
W dniu 30.10.2018 o 15:16, Eric Dumazet pisze: On 10/30/2018 01:09 AM, Paweł Staszewski wrote: W dniu 30.10.2018 o 08:29, Eric Dumazet pisze: On 10/29/2018 11:09 PM, Dimitris Michailidis wrote: Indeed this is a bug. I would expect it to produce frequent errors though as many odd-length packets would trigger it. Do you have RXFCS? Regardless, how frequently do you see the problem? Old kernels (before 88078d98d1bb) were simply resetting ip_summed to CHECKSUM_NONE And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the bug you fixed. So we now need to also fix mlx5. And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned earlier, plus __get_unaligned_cpu32() as you hinted. No RXFCS And this trace is rly frequently like once per 3/4 seconds like below: [28965.776864] vlan1490: hw csum failure Might be vlan related. Can you first check this : diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index 94224c22ecc310a87b6715051e335446f29bec03..6f4bfebf0d9a3ae7567062abb3ea6532b3aaf3d6 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -789,13 +789,8 @@ static inline void mlx5e_handle_csum(struct net_device *netdev, skb->ip_summed = CHECKSUM_COMPLETE; skb->csum = csum_unfold((__force __sum16)cqe->check_sum); if (network_depth > ETH_HLEN) - /* CQE csum is calculated from the IP header and does -* not cover VLAN headers (if present). This will add -* the checksum manually. -*/ - skb->csum = csum_partial(skb->data + ETH_HLEN, -network_depth - ETH_HLEN, -skb->csum); + /* Temporary debugging */ + skb->ip_summed = CHECKSUM_NONE; if (unlikely(netdev->features & NETIF_F_RXFCS)) skb->csum = csum_add(skb->csum, (__force __wsum)mlx5e_get_fcs(skb)); Ok thanks - will try it.
Re: Latest net-next kernel 4.19.0+
W dniu 31.10.2018 o 22:05, Saeed Mahameed pisze: On Tue, 2018-10-30 at 10:32 -0700, Cong Wang wrote: On Tue, Oct 30, 2018 at 7:16 AM Eric Dumazet wrote: On 10/30/2018 01:09 AM, Paweł Staszewski wrote: W dniu 30.10.2018 o 08:29, Eric Dumazet pisze: On 10/29/2018 11:09 PM, Dimitris Michailidis wrote: Indeed this is a bug. I would expect it to produce frequent errors though as many odd-length packets would trigger it. Do you have RXFCS? Regardless, how frequently do you see the problem? Old kernels (before 88078d98d1bb) were simply resetting ip_summed to CHECKSUM_NONE And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the bug you fixed. So we now need to also fix mlx5. And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned earlier, plus __get_unaligned_cpu32() as you hinted. No RXFCS Same with Pawel, RXFCS is disabled by default. And this trace is rly frequently like once per 3/4 seconds like below: [28965.776864] vlan1490: hw csum failure Might be vlan related. Hi Pawel, is the vlan stripping offload disabled or enabled in your case ? To verify: ethtool -k | grep rx-vlan-offload rx-vlan-offload: on To set: ethtool -K rxvlan on/off Enabled: ethtool -k enp175s0f0 Features for enp175s0f0: rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: on tx-checksum-ip-generic: off [fixed] tx-checksum-ipv6: on tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: on udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off [fixed] rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: off receive-hashing: on highdma: on [fixed] rx-vlan-filter: on vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-ipxip4-segmentation: off [fixed] tx-ipxip6-segmentation: off [fixed] tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: on tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] tx-udp-segmentation: on fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off rx-all: off tx-vlan-stag-hw-insert: on rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: on [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: on esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: on tls-hw-tx-offload: off [fixed] tls-hw-rx-offload: off [fixed] rx-gro-hw: off [fixed] tls-hw-record: off [fixed] if the vlan offload is off then it will trigger the mlx5e vlan csum adjustment code pointed out by Eric. Anyhow, it should work in both cases, but i am trying to narrow down the possibilities. Also could it be a double tagged packet ? no double tagged packets there Unlike Pawel's case, we don't use vlan at all, maybe this is why we see it much less frequently than Pawel. Also, it is probably not specific to mlx5, as there is another report which is probably a non-mlx5 driver. Cong, How often does this happen ? can you some how verify if the problematic packet has extra end padding after the ip payload ? It would be cool if we had a feature in kernel to store such SKB in memory when such issue occurs, and let the user dump it later (via tcpdump) and send the dump to the vendor for debug so we could just replay and see what happens. Thanks.
Re: Latest net-next kernel 4.19.0+
W dniu 30.10.2018 o 08:29, Eric Dumazet pisze: On 10/29/2018 11:09 PM, Dimitris Michailidis wrote: Indeed this is a bug. I would expect it to produce frequent errors though as many odd-length packets would trigger it. Do you have RXFCS? Regardless, how frequently do you see the problem? Old kernels (before 88078d98d1bb) were simply resetting ip_summed to CHECKSUM_NONE And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the bug you fixed. So we now need to also fix mlx5. And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned earlier, plus __get_unaligned_cpu32() as you hinted. No RXFCS And this trace is rly frequently like once per 3/4 seconds like below: [28965.776864] vlan1490: hw csum failure [28965.776867] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0+ #1 [28965.776868] Call Trace: [28965.776870] [28965.776876] dump_stack+0x46/0x5b [28965.776879] __skb_checksum_complete+0x9a/0xa0 [28965.776882] tcp_v4_rcv+0xef/0x960 [28965.776884] ip_local_deliver_finish+0x49/0xd0 [28965.776886] ip_local_deliver+0x5e/0xe0 [28965.776888] ? ip_sublist_rcv_finish+0x50/0x50 [28965.776889] ip_rcv+0x41/0xc0 [28965.776891] __netif_receive_skb_one_core+0x4b/0x70 [28965.776893] netif_receive_skb_internal+0x2f/0xd0 [28965.776894] napi_gro_receive+0xb7/0xe0 [28965.776897] mlx5e_handle_rx_cqe+0x7a/0xd0 [28965.776899] mlx5e_poll_rx_cq+0xc6/0x930 [28965.776900] mlx5e_napi_poll+0xab/0xc90 [28965.776904] ? kmem_cache_free_bulk+0x1e4/0x280 [28965.776905] net_rx_action+0x1f1/0x320 [28965.776909] __do_softirq+0xec/0x2b7 [28965.776912] irq_exit+0x7b/0x80 [28965.776913] do_IRQ+0x45/0xc0 [28965.776915] common_interrupt+0xf/0xf [28965.776916] [28965.776918] RIP: 0010:mwait_idle+0x5f/0x1b0 [28965.776919] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 4c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 0f 01 c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0 [28965.776920] RSP: 0018:82203e98 EFLAGS: 0246 ORIG_RAX: ffd3 [28965.776921] RAX: RBX: RCX: [28965.776922] RDX: RSI: RDI: [28965.776922] RBP: R08: 00aa R09: 88046f81fbc0 [28965.776923] R10: R11: 0001006d5985 R12: 8220f780 [28965.776924] R13: 8220f780 R14: R15: [28965.776927] do_idle+0x1a3/0x1c0 [28965.776929] cpu_startup_entry+0x14/0x20 [28965.776932] start_kernel+0x488/0x4a8 [28965.776935] secondary_startup_64+0xa4/0xb0 [28965.981529] vlan1490: hw csum failure [28965.981531] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0+ #1 [28965.981532] Call Trace: [28965.981534] [28965.981539] dump_stack+0x46/0x5b [28965.981543] __skb_checksum_complete+0x9a/0xa0 [28965.981545] tcp_v4_rcv+0xef/0x960 [28965.981548] ip_local_deliver_finish+0x49/0xd0 [28965.981550] ip_local_deliver+0x5e/0xe0 [28965.981551] ? ip_sublist_rcv_finish+0x50/0x50 [28965.981552] ip_rcv+0x41/0xc0 [28965.981555] __netif_receive_skb_one_core+0x4b/0x70 [28965.981556] netif_receive_skb_internal+0x2f/0xd0 [28965.981558] napi_gro_receive+0xb7/0xe0 [28965.981560] mlx5e_handle_rx_cqe+0x7a/0xd0 [28965.981562] mlx5e_poll_rx_cq+0xc6/0x930 [28965.981563] mlx5e_napi_poll+0xab/0xc90 [28965.981567] ? kmem_cache_free_bulk+0x1e4/0x280 [28965.981568] net_rx_action+0x1f1/0x320 [28965.981571] __do_softirq+0xec/0x2b7 [28965.981575] irq_exit+0x7b/0x80 [28965.981576] do_IRQ+0x45/0xc0 [28965.981578] common_interrupt+0xf/0xf [28965.981579] [28965.981580] RIP: 0010:mwait_idle+0x5f/0x1b0 [28965.981582] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 4c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 0f 01 c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0 [28965.981583] RSP: 0018:82203e98 EFLAGS: 0246 ORIG_RAX: ffd3 [28965.981584] RAX: RBX: RCX: [28965.981585] RDX: RSI: RDI: [28965.981586] RBP: R08: 0383 R09: 88046f81fbc0 [28965.981586] R10: R11: 0001006d59b8 R12: 8220f780 [28965.981587] R13: 8220f780 R14: R15: [28965.981591] do_idle+0x1a3/0x1c0 [28965.981592] cpu_startup_entry+0x14/0x20 [28965.981596] start_kernel+0x488/0x4a8 [28965.981600] secondary_startup_64+0xa4/0xb0 [28966.511782] vlan1490: hw csum failure [28966.511785] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0+ #1 [28966.511785] Call Trace: [28966.511787] [28966.511793] dump_stack+0x46/0x5b [28966.511797] __skb_checksum_complete+0x9a/0xa0 [28966.511799] tcp_v4_rcv+0xef/0x960 [28966.511802] ip_local_deliver_finish+0x49/0xd0 [28966.511804] ip_local_deliver+0x5e/0xe0 [28966.511806] ? ip_sublist_rcv_finish+0x50/0x50 [
Re: Latest net-next kernel 4.19.0+
W dniu 30.10.2018 o 01:11, Paweł Staszewski pisze: Sorry not complete - followed by hw csum: [ 342.190831] vlan1490: hw csum failure [ 342.190835] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1 [ 342.190836] Call Trace: [ 342.190839] [ 342.190849] dump_stack+0x46/0x5b [ 342.190856] __skb_checksum_complete+0x9a/0xa0 [ 342.190859] tcp_v4_rcv+0xef/0x960 [ 342.190864] ip_local_deliver_finish+0x49/0xd0 [ 342.190866] ip_local_deliver+0x5e/0xe0 [ 342.190869] ? ip_sublist_rcv_finish+0x50/0x50 [ 342.190870] ip_rcv+0x41/0xc0 [ 342.190874] __netif_receive_skb_one_core+0x4b/0x70 [ 342.190877] netif_receive_skb_internal+0x2f/0xd0 [ 342.190879] napi_gro_receive+0xb7/0xe0 [ 342.190884] mlx5e_handle_rx_cqe+0x7a/0xd0 [ 342.190886] mlx5e_poll_rx_cq+0xc6/0x930 [ 342.190888] mlx5e_napi_poll+0xab/0xc90 [ 342.190893] ? kmem_cache_free_bulk+0x1e4/0x280 [ 342.190895] net_rx_action+0x1f1/0x320 [ 342.190901] __do_softirq+0xec/0x2b7 [ 342.190908] irq_exit+0x7b/0x80 [ 342.190910] do_IRQ+0x45/0xc0 [ 342.190912] common_interrupt+0xf/0xf [ 342.190914] [ 342.190916] RIP: 0010:mwait_idle+0x5f/0x1b0 [ 342.190917] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 4c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 0f 01 c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0 [ 342.190918] RSP: 0018:c900034e7eb8 EFLAGS: 0246 ORIG_RAX: ffdd [ 342.190920] RAX: RBX: 0034 RCX: [ 342.190921] RDX: RSI: RDI: [ 342.190922] RBP: 0034 R08: 0057 R09: 88086fa1fbc0 [ 342.190923] R10: R11: 000128cc R12: 88086d18 [ 342.190923] R13: 88086d18 R14: R15: [ 342.190929] do_idle+0x1a3/0x1c0 [ 342.190931] cpu_startup_entry+0x14/0x20 [ 342.190934] start_secondary+0x165/0x190 [ 342.190939] secondary_startup_64+0xa4/0xb0 W dniu 30.10.2018 o 01:10, Paweł Staszewski pisze: Hi Just checked in test lab latest kernel and have weird traces: [ 219.888673] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1 [ 219.888674] Call Trace: [ 219.888676] [ 219.888685] dump_stack+0x46/0x5b [ 219.888691] __skb_checksum_complete+0x9a/0xa0 [ 219.888694] tcp_v4_rcv+0xef/0x960 [ 219.888698] ip_local_deliver_finish+0x49/0xd0 [ 219.888700] ip_local_deliver+0x5e/0xe0 [ 219.888702] ? ip_sublist_rcv_finish+0x50/0x50 [ 219.888703] ip_rcv+0x41/0xc0 [ 219.888706] __netif_receive_skb_one_core+0x4b/0x70 [ 219.888708] netif_receive_skb_internal+0x2f/0xd0 [ 219.888710] napi_gro_receive+0xb7/0xe0 [ 219.888714] mlx5e_handle_rx_cqe+0x7a/0xd0 [ 219.888716] mlx5e_poll_rx_cq+0xc6/0x930 [ 219.888717] mlx5e_napi_poll+0xab/0xc90 [ 219.888722] ? enqueue_task_fair+0x286/0xc40 [ 219.888723] ? enqueue_task_fair+0x1d6/0xc40 [ 219.888725] net_rx_action+0x1f1/0x320 [ 219.888730] __do_softirq+0xec/0x2b7 [ 219.888736] irq_exit+0x7b/0x80 [ 219.888737] do_IRQ+0x45/0xc0 [ 219.888740] common_interrupt+0xf/0xf [ 219.888742] [ 219.888743] RIP: 0010:mwait_idle+0x5f/0x1b0 [ 219.888745] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 4c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 0f 01 c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0 [ 219.888746] RSP: 0018:c900034e7eb8 EFLAGS: 0246 ORIG_RAX: ffde [ 219.888749] RAX: RBX: 0034 RCX: [ 219.888749] RDX: RSI: RDI: [ 219.888750] RBP: 0034 R08: 003b R09: 88086fa1fbc0 [ 219.888751] R10: R11: b15d R12: 88086d18 [ 219.888752] R13: 88086d18 R14: R15: [ 219.888754] do_idle+0x1a3/0x1c0 [ 219.888757] cpu_startup_entry+0x14/0x20 [ 219.888760] start_secondary+0x165/0x190 Also some perf top attacked to this - 14G rx traffic on vlans (pktgen generated random destination ip's and forwarded by test server) PerfTop: 45296 irqs/sec kernel:99.3% exact: 0.0% [4000Hz cycles], (all, 56 CPUs) --- 7.43% [kernel] [k] mlx5e_skb_from_cqe_linear 5.17% [kernel] [k] mlx5e_sq_xmit 3.83% [kernel] [k] fib_table_lookup 3.41% [kernel] [k] irq_entries_start 2.91% [kernel] [k] build_skb 2.50% [kernel] [k] mlx5_eq_int 2.29% [kernel] [k] _raw_spin_lock 2.27% [kernel] [k] tasklet_action_common.isra.21 1.99% [kernel] [k] _raw_spin_lock_irqsave 1.91% [kernel] [k] memcpy_erms
Re: Latest net-next kernel 4.19.0+
Sorry not complete - followed by hw csum: [ 342.190831] vlan1490: hw csum failure [ 342.190835] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1 [ 342.190836] Call Trace: [ 342.190839] [ 342.190849] dump_stack+0x46/0x5b [ 342.190856] __skb_checksum_complete+0x9a/0xa0 [ 342.190859] tcp_v4_rcv+0xef/0x960 [ 342.190864] ip_local_deliver_finish+0x49/0xd0 [ 342.190866] ip_local_deliver+0x5e/0xe0 [ 342.190869] ? ip_sublist_rcv_finish+0x50/0x50 [ 342.190870] ip_rcv+0x41/0xc0 [ 342.190874] __netif_receive_skb_one_core+0x4b/0x70 [ 342.190877] netif_receive_skb_internal+0x2f/0xd0 [ 342.190879] napi_gro_receive+0xb7/0xe0 [ 342.190884] mlx5e_handle_rx_cqe+0x7a/0xd0 [ 342.190886] mlx5e_poll_rx_cq+0xc6/0x930 [ 342.190888] mlx5e_napi_poll+0xab/0xc90 [ 342.190893] ? kmem_cache_free_bulk+0x1e4/0x280 [ 342.190895] net_rx_action+0x1f1/0x320 [ 342.190901] __do_softirq+0xec/0x2b7 [ 342.190908] irq_exit+0x7b/0x80 [ 342.190910] do_IRQ+0x45/0xc0 [ 342.190912] common_interrupt+0xf/0xf [ 342.190914] [ 342.190916] RIP: 0010:mwait_idle+0x5f/0x1b0 [ 342.190917] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 4c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 0f 01 c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0 [ 342.190918] RSP: 0018:c900034e7eb8 EFLAGS: 0246 ORIG_RAX: ffdd [ 342.190920] RAX: RBX: 0034 RCX: [ 342.190921] RDX: RSI: RDI: [ 342.190922] RBP: 0034 R08: 0057 R09: 88086fa1fbc0 [ 342.190923] R10: R11: 000128cc R12: 88086d18 [ 342.190923] R13: 88086d18 R14: R15: [ 342.190929] do_idle+0x1a3/0x1c0 [ 342.190931] cpu_startup_entry+0x14/0x20 [ 342.190934] start_secondary+0x165/0x190 [ 342.190939] secondary_startup_64+0xa4/0xb0 W dniu 30.10.2018 o 01:10, Paweł Staszewski pisze: Hi Just checked in test lab latest kernel and have weird traces: [ 219.888673] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1 [ 219.888674] Call Trace: [ 219.888676] [ 219.888685] dump_stack+0x46/0x5b [ 219.888691] __skb_checksum_complete+0x9a/0xa0 [ 219.888694] tcp_v4_rcv+0xef/0x960 [ 219.888698] ip_local_deliver_finish+0x49/0xd0 [ 219.888700] ip_local_deliver+0x5e/0xe0 [ 219.888702] ? ip_sublist_rcv_finish+0x50/0x50 [ 219.888703] ip_rcv+0x41/0xc0 [ 219.888706] __netif_receive_skb_one_core+0x4b/0x70 [ 219.888708] netif_receive_skb_internal+0x2f/0xd0 [ 219.888710] napi_gro_receive+0xb7/0xe0 [ 219.888714] mlx5e_handle_rx_cqe+0x7a/0xd0 [ 219.888716] mlx5e_poll_rx_cq+0xc6/0x930 [ 219.888717] mlx5e_napi_poll+0xab/0xc90 [ 219.888722] ? enqueue_task_fair+0x286/0xc40 [ 219.888723] ? enqueue_task_fair+0x1d6/0xc40 [ 219.888725] net_rx_action+0x1f1/0x320 [ 219.888730] __do_softirq+0xec/0x2b7 [ 219.888736] irq_exit+0x7b/0x80 [ 219.888737] do_IRQ+0x45/0xc0 [ 219.888740] common_interrupt+0xf/0xf [ 219.888742] [ 219.888743] RIP: 0010:mwait_idle+0x5f/0x1b0 [ 219.888745] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 4c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 0f 01 c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0 [ 219.888746] RSP: 0018:c900034e7eb8 EFLAGS: 0246 ORIG_RAX: ffde [ 219.888749] RAX: RBX: 0034 RCX: [ 219.888749] RDX: RSI: RDI: [ 219.888750] RBP: 0034 R08: 003b R09: 88086fa1fbc0 [ 219.888751] R10: R11: b15d R12: 88086d18 [ 219.888752] R13: 88086d18 R14: R15: [ 219.888754] do_idle+0x1a3/0x1c0 [ 219.888757] cpu_startup_entry+0x14/0x20 [ 219.888760] start_secondary+0x165/0x190
Latest net-next kernel 4.19.0+
Hi Just checked in test lab latest kernel and have weird traces: [ 219.888673] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1 [ 219.888674] Call Trace: [ 219.888676] [ 219.888685] dump_stack+0x46/0x5b [ 219.888691] __skb_checksum_complete+0x9a/0xa0 [ 219.888694] tcp_v4_rcv+0xef/0x960 [ 219.888698] ip_local_deliver_finish+0x49/0xd0 [ 219.888700] ip_local_deliver+0x5e/0xe0 [ 219.888702] ? ip_sublist_rcv_finish+0x50/0x50 [ 219.888703] ip_rcv+0x41/0xc0 [ 219.888706] __netif_receive_skb_one_core+0x4b/0x70 [ 219.888708] netif_receive_skb_internal+0x2f/0xd0 [ 219.888710] napi_gro_receive+0xb7/0xe0 [ 219.888714] mlx5e_handle_rx_cqe+0x7a/0xd0 [ 219.888716] mlx5e_poll_rx_cq+0xc6/0x930 [ 219.888717] mlx5e_napi_poll+0xab/0xc90 [ 219.888722] ? enqueue_task_fair+0x286/0xc40 [ 219.888723] ? enqueue_task_fair+0x1d6/0xc40 [ 219.888725] net_rx_action+0x1f1/0x320 [ 219.888730] __do_softirq+0xec/0x2b7 [ 219.888736] irq_exit+0x7b/0x80 [ 219.888737] do_IRQ+0x45/0xc0 [ 219.888740] common_interrupt+0xf/0xf [ 219.888742] [ 219.888743] RIP: 0010:mwait_idle+0x5f/0x1b0 [ 219.888745] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 4c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 0f 01 c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0 [ 219.888746] RSP: 0018:c900034e7eb8 EFLAGS: 0246 ORIG_RAX: ffde [ 219.888749] RAX: RBX: 0034 RCX: [ 219.888749] RDX: RSI: RDI: [ 219.888750] RBP: 0034 R08: 003b R09: 88086fa1fbc0 [ 219.888751] R10: R11: b15d R12: 88086d18 [ 219.888752] R13: 88086d18 R14: R15: [ 219.888754] do_idle+0x1a3/0x1c0 [ 219.888757] cpu_startup_entry+0x14/0x20 [ 219.888760] start_secondary+0x165/0x190
Re: after adding > 200vlans to mlx nic no traffic
W dniu 31.01.2018 o 13:19, Gal Pressman pisze: On 30-Jan-18 17:57, Paweł Staszewski wrote: W dniu 30.01.2018 o 15:57, Gal Pressman pisze: On 30-Jan-18 02:29, Paweł Staszewski wrote: Weird thing with mellanox mlx5 (connectx-4) kernel 4.15-rc9 - from net-next davem tree after: ip link add link enp175s0f1 name vlan1538 type vlan id 1538 ip link set up dev vlan1538 traffic on vlan is working But after VID="1160 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 150 0 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1534 1535 1394 1393 1550 1500 1526 1536 1537 1538 1539 1540 1542 1541 1543 1544 1801 1546 1547 1548 1 549 1735 3132 3143 3104 3125 3103 3115 3134 3105 3113 3141 4009 3144 3130 1803 3146 3148 3109 1551 1552 1553 1554 1555 1556 1558 1559 1560 1561 1562 1563 1564 1565 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1591 1592 1593 1594 1595 1596 1597 1598 1599 1557 1545 2001 250 4043 1806 1600 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1625 1626 1627 1628 1629 1630 1631 1632 1634 1635 1636 1640 1641 164 2 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1601 1666 1667 1668 1669 1670 1671 1672 1673 1674 1676 1677 1678 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1696 1 697 1698 1712 1817 1869 1810 1814 1818 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1885 1890 1891 1892 1893 1894 1895 1898 1881 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2290" for i in $VID do ip link add link enp175s0f1 name vlan$i type vlan id $i done And setting vlan 1538 up - there is no received traffic on this vlan. So searching for broken things (last time same problem was with ixgbe) ethtool -K enp175s0f1 rx-vlan-filter off And all vlans attached to this device start working Hi Pawel, I tried to reproduce the issue in our local setups without success. Can you please provide more information? are there any errors in dmesg? did you configure anything else that might be relevant to this issue? Do you know if this is a new degradation to 4.15-rc9? previous kernel used was 4.13.2 - without this problem. current kernel is net-next 4.15.0-rc9+ https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git Try to send traffic over the vlans and sample the ethtool counters (ethtool -S enp175s0f1) of the receiver mlx5 interface over time, this might help us trace where the packets drop. Yes traffic is going out from interface - bot there is nothing on RX - tcpdump shows no packets arriving to interface I am running 4.15.0-rc9+ from Dave's tree, currently on commit 91e6dd828425 ("ipmr: Fix ptrdiff_t print formatting"). Tested with the commands you provided and same configuration, the issue does not reproduce on our setups. Did you see any errors in dmesg? anything coming from mlx5 driver? No errors in dmesg Which firmware version are you using? Please provide your .config file, perhaps it is making the difference. Ok maybee I will add also ethtool configuration that is started before ip link vlan is added: ifc='enp175s0f0 enp175s0f1' for i in $ifc do ip link set up dev $i ethtool -A $i autoneg off rx off tx off ethtool -G $i rx 4096 tx 4096 ip link set $i txqueuelen 1000 ethtool -L $i combined 28 ethtool -N $i rx-flow-hash udp4 sdfn ethtool -C $i adaptive-rx off rx-usecs 256 rx-frames 128 done There are two interfaces enp175s0f0 enp175s0f1 First one have also some vlans: Below full list: cat /proc/net/vlan/config VLAN Dev name | VLAN ID Name-Type: VLAN_NAME_TYPE_RAW_PLUS_VID_NO_PAD vlan1538 | 1538 | enp175s0f1 vlan1160 | 1160 | enp175s0f1 vlan1450 | 1450 | enp175s0f1 vlan1451 | 1451 | enp175s0f1 vlan1452 | 1452 | enp175s0f1 vlan1453 | 1453 | enp175s0f1 vlan1454 | 1454 | enp175s0f1 vlan1455 | 1455 | enp175s0f1 vlan1456 | 1456 | enp175s0f1 vlan1457 | 1457 | enp175s0f1 vlan1458 | 1458 | enp175s0f1 vlan1459 | 1459 | enp175s0f1 vlan1460 | 1460 | enp175s0f1 vlan1461 | 1461 | enp175s0f1 vlan1462 | 1462 | enp175s0f1 vlan1463 | 1463 | enp175s0f1 vlan1464 | 1464 | enp175s0f1 vlan1465 | 1465 | enp175s0f1 vlan1466 | 1466 | enp175s0f1 vlan1467 | 1467 | enp175s0f1
Re: after adding > 200vlans to mlx nic no traffic
W dniu 30.01.2018 o 15:57, Gal Pressman pisze: On 30-Jan-18 02:29, Paweł Staszewski wrote: Weird thing with mellanox mlx5 (connectx-4) kernel 4.15-rc9 - from net-next davem tree after: ip link add link enp175s0f1 name vlan1538 type vlan id 1538 ip link set up dev vlan1538 traffic on vlan is working But after VID="1160 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 150 0 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1534 1535 1394 1393 1550 1500 1526 1536 1537 1538 1539 1540 1542 1541 1543 1544 1801 1546 1547 1548 1 549 1735 3132 3143 3104 3125 3103 3115 3134 3105 3113 3141 4009 3144 3130 1803 3146 3148 3109 1551 1552 1553 1554 1555 1556 1558 1559 1560 1561 1562 1563 1564 1565 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1591 1592 1593 1594 1595 1596 1597 1598 1599 1557 1545 2001 250 4043 1806 1600 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1625 1626 1627 1628 1629 1630 1631 1632 1634 1635 1636 1640 1641 164 2 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1601 1666 1667 1668 1669 1670 1671 1672 1673 1674 1676 1677 1678 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1696 1 697 1698 1712 1817 1869 1810 1814 1818 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1885 1890 1891 1892 1893 1894 1895 1898 1881 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2290" for i in $VID do ip link add link enp175s0f1 name vlan$i type vlan id $i done And setting vlan 1538 up - there is no received traffic on this vlan. So searching for broken things (last time same problem was with ixgbe) ethtool -K enp175s0f1 rx-vlan-filter off And all vlans attached to this device start working Hi Pawel, I tried to reproduce the issue in our local setups without success. Can you please provide more information? are there any errors in dmesg? did you configure anything else that might be relevant to this issue? Do you know if this is a new degradation to 4.15-rc9? previous kernel used was 4.13.2 - without this problem. current kernel is net-next 4.15.0-rc9+ https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git Try to send traffic over the vlans and sample the ethtool counters (ethtool -S enp175s0f1) of the receiver mlx5 interface over time, this might help us trace where the packets drop. Yes traffic is going out from interface - bot there is nothing on RX - tcpdump shows no packets arriving to interface Thank you for reporting this, Gal Interface settings: (working case with rx vlan filter turned off) ethtool -k enp175s0f1 Features for enp175s0f1: rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: on tx-checksum-ip-generic: off [fixed] tx-checksum-ipv6: on tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: on udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: off receive-hashing: on highdma: on [fixed] rx-vlan-filter: off vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-ipxip4-segmentation: off [fixed] tx-ipxip6-segmentation: off [fixed] tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: on tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off rx-all: off tx-vlan-stag-hw-insert: on rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: on [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: off esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: on rx-gro-hw: off [fixed] Coalesce parameters for enp175s0f1: Adaptive RX: off TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 dmac: 32571 rx-usecs: 256 rx-frames: 128 rx-usecs-irq: 0 rx-frames-irq: 0 tx-usecs: 16 tx-frames: 32 tx-usecs-irq: 0 tx-frames-irq: 0 rx
after adding > 200vlans to mlx nic no traffic
Weird thing with mellanox mlx5 (connectx-4) kernel 4.15-rc9 - from net-next davem tree after: ip link add link enp175s0f1 name vlan1538 type vlan id 1538 ip link set up dev vlan1538 traffic on vlan is working But after VID="1160 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 150 0 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1534 1535 1394 1393 1550 1500 1526 1536 1537 1538 1539 1540 1542 1541 1543 1544 1801 1546 1547 1548 1 549 1735 3132 3143 3104 3125 3103 3115 3134 3105 3113 3141 4009 3144 3130 1803 3146 3148 3109 1551 1552 1553 1554 1555 1556 1558 1559 1560 1561 1562 1563 1564 1565 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1591 1592 1593 1594 1595 1596 1597 1598 1599 1557 1545 2001 250 4043 1806 1600 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1625 1626 1627 1628 1629 1630 1631 1632 1634 1635 1636 1640 1641 164 2 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1601 1666 1667 1668 1669 1670 1671 1672 1673 1674 1676 1677 1678 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1696 1 697 1698 1712 1817 1869 1810 1814 1818 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1885 1890 1891 1892 1893 1894 1895 1898 1881 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2290" for i in $VID do ip link add link enp175s0f1 name vlan$i type vlan id $i done And setting vlan 1538 up - there is no received traffic on this vlan. So searching for broken things (last time same problem was with ixgbe) ethtool -K enp175s0f1 rx-vlan-filter off And all vlans attached to this device start working
xdp_router_ipv4 mellanox problem
Hi Want to do some tests with xdp_router on two 100G physical interfaces but: Jan 29 17:00:40 HOST kernel: mlx5_core :af:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(1) RxCqeCmprss(0) Jan 29 17:00:40 HOST kernel: mlx5_core :af:00.0 enp175s0f0: Link up Jan 29 17:00:41 HOST kernel: mlx5_core :af:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(1) RxCqeCmprss(0) Jan 29 17:00:41 HOST kernel: mlx5_core :af:00.1 enp175s0f1: Link up Jan 29 17:00:41 HOST kernel: [ cut here ] Jan 29 17:00:41 HOST kernel: Driver unsupported XDP return value 4, expect packet loss! Jan 29 17:00:41 HOST kernel: WARNING: CPU: 43 PID: 0 at net/core/filter.c:3901 bpf_warn_invalid_xdp_action+0x34/0x40 Jan 29 17:00:41 HOST kernel: Modules linked in: x86_pkg_temp_thermal ipmi_si Jan 29 17:00:41 HOST kernel: CPU: 43 PID: 0 Comm: swapper/43 Not tainted 4.15.0-rc9+ #1 Jan 29 17:00:41 HOST kernel: RIP: 0010:bpf_warn_invalid_xdp_action+0x34/0x40 Jan 29 17:00:41 HOST kernel: RSP: 0018:88087f9c3dc8 EFLAGS: 00010296 Jan 29 17:00:41 HOST kernel: RAX: 003a RBX: 88081ea38000 RCX: 0006 Jan 29 17:00:41 HOST kernel: RDX: 0007 RSI: 0092 RDI: 88087f9d53d0 Jan 29 17:00:41 HOST kernel: RBP: 88087f9c3e58 R08: 0001 R09: 0536 Jan 29 17:00:41 HOST kernel: R10: 0004 R11: 0536 R12: 8808304d3000 Jan 29 17:00:41 HOST kernel: R13: 02c0 R14: 88081e53c000 R15: c907d000 Jan 29 17:00:41 HOST kernel: FS: () GS:88087f9c() knlGS: Jan 29 17:00:41 HOST kernel: CS: 0010 DS: ES: CR0: 80050033 Jan 29 17:00:41 HOST kernel: CR2: 02038648 CR3: 0220a002 CR4: 007606e0 Jan 29 17:00:41 HOST kernel: DR0: DR1: DR2: Jan 29 17:00:41 HOST kernel: DR3: DR6: fffe0ff0 DR7: 0400 Jan 29 17:00:41 HOST kernel: PKRU: 5554 Jan 29 17:00:41 HOST kernel: Call Trace: Jan 29 17:00:41 HOST kernel: Jan 29 17:00:41 HOST kernel: mlx5e_handle_rx_cqe+0x279/0x900 Jan 29 17:00:41 HOST kernel: mlx5e_poll_rx_cq+0xb3/0x860 Jan 29 17:00:41 HOST kernel: mlx5e_napi_poll+0x81/0x6f0 Jan 29 17:00:41 HOST kernel: ? mlx5_cq_completion+0x4d/0xb0 Jan 29 17:00:41 HOST kernel: net_rx_action+0x1cd/0x2f0 Jan 29 17:00:41 HOST kernel: __do_softirq+0xe4/0x275 Jan 29 17:00:41 HOST kernel: irq_exit+0x6b/0x70 Jan 29 17:00:41 HOST kernel: do_IRQ+0x45/0xc0 Jan 29 17:00:41 HOST kernel: common_interrupt+0x95/0x95 Jan 29 17:00:41 HOST kernel: Jan 29 17:00:41 HOST kernel: RIP: 0010:mwait_idle+0x59/0x160 Jan 29 17:00:41 HOST kernel: RSP: 0018:c90003497ef8 EFLAGS: 0246 ORIG_RAX: ffdd Jan 29 17:00:41 HOST kernel: RAX: RBX: 002b RCX: Jan 29 17:00:41 HOST kernel: RDX: RSI: RDI: Jan 29 17:00:41 HOST kernel: RBP: 002b R08: 1000 R09: Jan 29 17:00:41 HOST kernel: R10: R11: 000100130e40 R12: 88086d165000 Jan 29 17:00:41 HOST kernel: R13: 88086d165000 R14: R15: Jan 29 17:00:41 HOST kernel: do_idle+0x14e/0x160 Jan 29 17:00:41 HOST kernel: cpu_startup_entry+0x14/0x20 Jan 29 17:00:41 HOST kernel: secondary_startup_64+0xa5/0xb0 Jan 29 17:00:41 HOST kernel: Code: c3 83 ff 04 48 c7 c0 1a cf 10 82 89 fa c6 05 9a df b4 00 01 48 c7 c6 22 cf 10 82 48 c7 c7 38 cf 10 82 48 0f 47 f0 e8 ec 19 8b ff <0f> ff c3 66 0f 1f 84 00 00 00 00 00 81 fe ff ff 00 00 55 48 89 Jan 29 17:00:41 HOST kernel: ---[ end trace 2b255fac8d0824de ]--- I can attach xdp_router_ipv4 to any vlan interface without crash ./xdp_router_ipv4 vlan4032 **loading bpf file* Attached to 8 ***ROUTE TABLE* NEW Route entry Destination Gateway Genmask Metric Iface 192.168.32.0 0 24 0 vlan4032 ***ARP TABLE*** Address HwAddress 7920a8c0 8da6fb902500 120a8c0 44fc9e0c5e4c But after attaching to physical interface there is "above trace". Thanks Paweł
Re: kernel 4.15.0-rc9+ (net-next) high cpu load at 50Gbit/s - about 6Mpps
W dniu 27.01.2018 o 23:23, Paweł Staszewski pisze: Hi Today I made some real life traffic tests with kernel 4.15.0-rc9 but when traffic reach 50Gbit/s and about 6Mpps cpou load rises fast from 48% to 100% for all cpu cores. Here is some graph that presenting how cpu load rises when there was more pps. https://ibb.co/mhD5ob here is perf record from that time: https://pastebin.com/3zqG1rvE There is 8x 10G ixgbe 82599 interfaces teamed with teamd. No traffic queueing - only pfifo fast on all interfaces. No NAT or iptables forles other than INPUT (about 30rules) All nic's have same ethtool settings: ethtool -k eth0 Features for eth0: Cannot get device udp-fragmentation-offload settings: Operation not supported rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: off [fixed] tx-checksum-ip-generic: on tx-checksum-ipv6: off [fixed] tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: on scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: on udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: on receive-hashing: on highdma: on [fixed] rx-vlan-filter: on vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-ipxip4-segmentation: on tx-ipxip6-segmentation: on tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: on tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off [fixed] rx-all: off tx-vlan-stag-hw-insert: off [fixed] rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off hw-tc-offload: off esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: on ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 2048 ethtool -c eth0 Coalesce parameters for eth0: Adaptive RX: off TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 512 rx-frames: 0 rx-usecs-irq: 0 rx-frames-irq: 0 tx-usecs: 0 tx-frames: 0 tx-usecs-irq: 0 tx-frames-irq: 0 rx-usecs-low: 0 rx-frame-low: 0 tx-usecs-low: 0 tx-frame-low: 0 rx-usecs-high: 0 rx-frame-high: 0 tx-usecs-high: 0 tx-frame-high: 0 Peft top for kernel 4.15.0-rc9 below (all 40 cores 100% cpu load with 6.3Mpps) 20.96% [kernel] [k] queued_spin_lock_slowpath 5.51% [kernel] [k] ixgbe_poll 5.49% [kernel] [k] ixgbe_xmit_frame_ring 4.39% [kernel] [k] do_raw_spin_lock 4.29% [kernel] [k] sch_direct_xmit 4.11% [kernel] [k] fib_table_lookup 3.11% [team_mode_roundrobin] [k] rr_transmit 2.71% [kernel] [k] __dev_queue_xmit 2.62% [kernel] [k] __ptr_ring_peek 2.39% [kernel] [k] skb_release_data 2.18% [kernel] [k] dev_gro_receive 1.75% [kernel] [k] __qdisc_run 1.67% [kernel] [k] pfifo_fast_enqueue 1.57% [kernel] [k] netdev_pick_tx 1.56% [kernel] [k] page_frag_free 1.48% [kernel] [k] ip_finish_output2 1.38% [kernel] [k] __slab_free 1.36% [kernel] [k] skb_unref 1.34% [kernel] [k] ixgbe_maybe_stop_tx 1.30% [kernel] [k] vlan_do_receive 1.28% [kernel] [k] pfifo_fast_dequeue 1.23% [kernel] [k] virt_to_head_page Same configuration kernel 4.15.0-rc3 (50% cpu load on all 40 cores with 6.3Mpps) 7.81% [kernel] [k] ixgbe_xmit_frame_ring 7.61% [kernel] [k] ixgbe_poll 7.09% [kernel] [k] do_raw_spin_lock 5.63% [kernel] [k] fib_table_lookup 5.19% [kernel] [k] __dev_queue_xmit 4.38% [team_mode_roundrobin] [k] rr_transmit 3.10% [kernel] [k] netdev_pick_tx 2.79% [kernel] [k] skb_release_data 2.34% [kernel] [k] dev_gro_receive 1.99% [kernel] [k] page_frag_free 1.96% [kernel] [k] skb_unref 1.92% [kernel] [k] virt_to_head_page 1.90% [kernel] [k] ixgbe_maybe_st
kernel 4.15.0-rc9+ (net-next) high cpu load at 50Gbit/s - about 6Mpps
Hi Today I made some real life traffic tests with kernel 4.15.0-rc9 but when traffic reach 50Gbit/s and about 6Mpps cpou load rises fast from 48% to 100% for all cpu cores. Here is some graph that presenting how cpu load rises when there was more pps. https://ibb.co/mhD5ob here is perf record from that time: https://pastebin.com/3zqG1rvE There is 8x 10G ixgbe 82599 interfaces teamed with teamd. No traffic queueing - only pfifo fast on all interfaces. No NAT or iptables forles other than INPUT (about 30rules) All nic's have same ethtool settings: ethtool -k eth0 Features for eth0: Cannot get device udp-fragmentation-offload settings: Operation not supported rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: off [fixed] tx-checksum-ip-generic: on tx-checksum-ipv6: off [fixed] tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: on scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: on udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: on receive-hashing: on highdma: on [fixed] rx-vlan-filter: on vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-ipxip4-segmentation: on tx-ipxip6-segmentation: on tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: on tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off [fixed] rx-all: off tx-vlan-stag-hw-insert: off [fixed] rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off hw-tc-offload: off esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: on ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 2048 ethtool -c eth0 Coalesce parameters for eth0: Adaptive RX: off TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 512 rx-frames: 0 rx-usecs-irq: 0 rx-frames-irq: 0 tx-usecs: 0 tx-frames: 0 tx-usecs-irq: 0 tx-frames-irq: 0 rx-usecs-low: 0 rx-frame-low: 0 tx-usecs-low: 0 tx-frame-low: 0 rx-usecs-high: 0 rx-frame-high: 0 tx-usecs-high: 0 tx-frame-high: 0
Re: Huge memory leak with 4.15.0-rc2+
W dniu 2017-12-11 o 23:27, Paweł Staszewski pisze: W dniu 2017-12-11 o 23:15, John Fastabend pisze: On 12/11/2017 01:48 PM, Paweł Staszewski wrote: W dniu 2017-12-11 o 22:23, Paweł Staszewski pisze: Hi I just upgraded some testing host to 4.15.0-rc2+ kernel And after some time of traffic processing - when traffic on all ports reach about 3Mpps - memleak started. [...] Some observations - when i disable tso on all cards there is more memleak. When traffic starts to drop - there is less and less memleak below link to memory usage graph: https://ibb.co/hU97kG And there is rising slab_unrecl - Amount of unreclaimable memory used for slab kernel allocations Forgot to add that im using hfsc and qdiscs like pfifo on classes. Maybe some error case I missed in the qdisc patches I'm looking into it. Thanks, John This is how it looks like when corelated on graph - traffic vs mem https://ibb.co/njpkqG Typical hfsc class + qdisc: ### Client interface vlan1616 tc qdisc del dev vlan1616 root tc qdisc add dev vlan1616 handle 1: root hfsc default 100 tc class add dev vlan1616 parent 1: classid 1:100 hfsc ls m2 200Mbit ul m2 200Mbit tc qdisc add dev vlan1616 parent 1:100 handle 100: pfifo limit 128 ### End TM for client interface tc qdisc del dev vlan1616 ingress tc qdisc add dev vlan1616 handle : ingress tc filter add dev vlan1616 parent : protocol ip prio 50 u32 match ip src 0.0.0.0/0 police rate 200Mbit burst 200M mtu 32k drop flowid 1:1 And this is same for about 450 vlan interfaces Good thing is that compared to 4.14.3 i have about 5% less cpu load on 4.15.0-rc2+ When hfsc will be lockless or tbf - then it will be really huge difference in cpu load on x86 when using traffic shaping - so really good job John. Yestarday changed kernel from https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=v4.15-rc3 And there is no memleak. So yes probabbly lockless qdisc patches
Re: Huge memory leak with 4.15.0-rc2+
W dniu 2017-12-11 o 23:15, John Fastabend pisze: On 12/11/2017 01:48 PM, Paweł Staszewski wrote: W dniu 2017-12-11 o 22:23, Paweł Staszewski pisze: Hi I just upgraded some testing host to 4.15.0-rc2+ kernel And after some time of traffic processing - when traffic on all ports reach about 3Mpps - memleak started. [...] Some observations - when i disable tso on all cards there is more memleak. When traffic starts to drop - there is less and less memleak below link to memory usage graph: https://ibb.co/hU97kG And there is rising slab_unrecl - Amount of unreclaimable memory used for slab kernel allocations Forgot to add that im using hfsc and qdiscs like pfifo on classes. Maybe some error case I missed in the qdisc patches I'm looking into it. Thanks, John This is how it looks like when corelated on graph - traffic vs mem https://ibb.co/njpkqG Typical hfsc class + qdisc: ### Client interface vlan1616 tc qdisc del dev vlan1616 root tc qdisc add dev vlan1616 handle 1: root hfsc default 100 tc class add dev vlan1616 parent 1: classid 1:100 hfsc ls m2 200Mbit ul m2 200Mbit tc qdisc add dev vlan1616 parent 1:100 handle 100: pfifo limit 128 ### End TM for client interface tc qdisc del dev vlan1616 ingress tc qdisc add dev vlan1616 handle : ingress tc filter add dev vlan1616 parent : protocol ip prio 50 u32 match ip src 0.0.0.0/0 police rate 200Mbit burst 200M mtu 32k drop flowid 1:1 And this is same for about 450 vlan interfaces Good thing is that compared to 4.14.3 i have about 5% less cpu load on 4.15.0-rc2+ When hfsc will be lockless or tbf - then it will be really huge difference in cpu load on x86 when using traffic shaping - so really good job John.
Re: Huge memory leak with 4.15.0-rc2+
W dniu 2017-12-11 o 22:23, Paweł Staszewski pisze: Hi I just upgraded some testing host to 4.15.0-rc2+ kernel And after some time of traffic processing - when traffic on all ports reach about 3Mpps - memleak started. Graph attached from memory usage: https://ibb.co/idK4zb HW config: Intel E5 8x Intel 82599 (used ixgbe driver from kernel) Interfaces with vlans attached All 8 ethernet ports are in one LAG group configured by team. With current settings (this host is acting as a router - and bgpd process is eating same amount of memory from the beginning about 5.2GB) cat /proc/meminfo MemTotal: 32770588 kB MemFree: 11342492 kB MemAvailable: 10982752 kB Buffers: 84704 kB Cached: 83180 kB SwapCached: 0 kB Active: 5105320 kB Inactive: 46252 kB Active(anon): 4985448 kB Inactive(anon): 1096 kB Active(file): 119872 kB Inactive(file): 45156 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 4005280 kB SwapFree: 4005280 kB Dirty: 236 kB Writeback: 0 kB AnonPages: 4983752 kB Mapped: 13556 kB Shmem: 2852 kB Slab: 1013124 kB SReclaimable: 45876 kB SUnreclaim: 967248 kB KernelStack: 7152 kB PageTables: 12164 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 20390572 kB Committed_AS: 396568 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 1407572 kB DirectMap2M: 20504576 kB DirectMap1G: 13631488 kB ps aux --sort -rss USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 6758 1.8 14.9 5044996 4886964 ? Sl 01:22 23:21 /usr/local/sbin/bgpd -d -u root -g root -I --ignore_warnings root 6752 0.0 0.1 86272 61920 ? Ss 01:22 0:16 /usr/local/sbin/zebra -d -u root -g root -I --ignore_warnings root 6766 12.6 0.0 51592 29196 ? S 01:22 157:48 /usr/sbin/snmpd -p /var/run/snmpd.pid -Ln root 7494 0.0 0.0 708976 5896 ? Ssl 01:22 0:09 /opt/collectd/sbin/collectd root 15531 0.0 0.0 67864 5056 ? Ss 21:57 0:00 sshd: paol [priv] root 4915 0.0 0.0 271912 4904 ? Ss 01:21 0:25 /usr/sbin/syslog-ng --persist-file /var/lib/syslog-ng/syslog-ng.persist --cfgfile /etc/syslog-ng/syslog-ng.conf --pidfile /run/syslog-ng.pid root 4278 0.0 0.0 37220 4164 ? Ss 01:21 0:00 /lib/systemd/systemd-udevd --daemon root 5147 0.0 0.0 32072 3232 ? Ss 01:21 0:00 /usr/sbin/sshd root 5203 0.0 0.0 28876 2436 ? S 01:21 0:00 teamd -d -f /etc/teamd.conf root 17372 0.0 0.0 17924 2388 pts/2 R+ 22:13 0:00 ps aux --sort -rss root 4789 0.0 0.0 5032 2176 ? Ss 01:21 0:00 mdadm --monitor --scan --daemonise --pid-file /var/run/mdadm.pid --syslog root 7511 0.0 0.0 12676 1920 tty4 Ss+ 01:22 0:00 /sbin/agetty 38400 tty4 linux root 7510 0.0 0.0 12676 1896 tty3 Ss+ 01:22 0:00 /sbin/agetty 38400 tty3 linux root 7512 0.0 0.0 12676 1860 tty5 Ss+ 01:22 0:00 /sbin/agetty 38400 tty5 linux root 7513 0.0 0.0 12676 1836 tty6 Ss+ 01:22 0:00 /sbin/agetty 38400 tty6 linux root 7509 0.0 0.0 12676 1832 tty2 Ss+ 01:22 0:00 /sbin/agetty 38400 tty2 linux And latest kernel that everything was working is: 4.14.3 Some observations - when i disable tso on all cards there is more memleak. When traffic starts to drop - there is less and less memleak below link to memory usage graph: https://ibb.co/hU97kG And there is rising slab_unrecl - Amount of unreclaimable memory used for slab kernel allocations Forgot to add that im using hfsc and qdiscs like pfifo on classes.
Huge memory leak with 4.15.0-rc2+
Hi I just upgraded some testing host to 4.15.0-rc2+ kernel And after some time of traffic processing - when traffic on all ports reach about 3Mpps - memleak started. Graph attached from memory usage: https://ibb.co/idK4zb HW config: Intel E5 8x Intel 82599 (used ixgbe driver from kernel) Interfaces with vlans attached All 8 ethernet ports are in one LAG group configured by team. With current settings (this host is acting as a router - and bgpd process is eating same amount of memory from the beginning about 5.2GB) cat /proc/meminfo MemTotal: 32770588 kB MemFree: 11342492 kB MemAvailable: 10982752 kB Buffers: 84704 kB Cached: 83180 kB SwapCached: 0 kB Active: 5105320 kB Inactive: 46252 kB Active(anon): 4985448 kB Inactive(anon): 1096 kB Active(file): 119872 kB Inactive(file): 45156 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 4005280 kB SwapFree: 4005280 kB Dirty: 236 kB Writeback: 0 kB AnonPages: 4983752 kB Mapped: 13556 kB Shmem: 2852 kB Slab: 1013124 kB SReclaimable: 45876 kB SUnreclaim: 967248 kB KernelStack: 7152 kB PageTables: 12164 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 20390572 kB Committed_AS: 396568 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 1407572 kB DirectMap2M: 20504576 kB DirectMap1G: 13631488 kB ps aux --sort -rss USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 6758 1.8 14.9 5044996 4886964 ? Sl 01:22 23:21 /usr/local/sbin/bgpd -d -u root -g root -I --ignore_warnings root 6752 0.0 0.1 86272 61920 ? Ss 01:22 0:16 /usr/local/sbin/zebra -d -u root -g root -I --ignore_warnings root 6766 12.6 0.0 51592 29196 ? S 01:22 157:48 /usr/sbin/snmpd -p /var/run/snmpd.pid -Ln root 7494 0.0 0.0 708976 5896 ? Ssl 01:22 0:09 /opt/collectd/sbin/collectd root 15531 0.0 0.0 67864 5056 ? Ss 21:57 0:00 sshd: paol [priv] root 4915 0.0 0.0 271912 4904 ? Ss 01:21 0:25 /usr/sbin/syslog-ng --persist-file /var/lib/syslog-ng/syslog-ng.persist --cfgfile /etc/syslog-ng/syslog-ng.conf --pidfile /run/syslog-ng.pid root 4278 0.0 0.0 37220 4164 ? Ss 01:21 0:00 /lib/systemd/systemd-udevd --daemon root 5147 0.0 0.0 32072 3232 ? Ss 01:21 0:00 /usr/sbin/sshd root 5203 0.0 0.0 28876 2436 ? S 01:21 0:00 teamd -d -f /etc/teamd.conf root 17372 0.0 0.0 17924 2388 pts/2 R+ 22:13 0:00 ps aux --sort -rss root 4789 0.0 0.0 5032 2176 ? Ss 01:21 0:00 mdadm --monitor --scan --daemonise --pid-file /var/run/mdadm.pid --syslog root 7511 0.0 0.0 12676 1920 tty4 Ss+ 01:22 0:00 /sbin/agetty 38400 tty4 linux root 7510 0.0 0.0 12676 1896 tty3 Ss+ 01:22 0:00 /sbin/agetty 38400 tty3 linux root 7512 0.0 0.0 12676 1860 tty5 Ss+ 01:22 0:00 /sbin/agetty 38400 tty5 linux root 7513 0.0 0.0 12676 1836 tty6 Ss+ 01:22 0:00 /sbin/agetty 38400 tty6 linux root 7509 0.0 0.0 12676 1832 tty2 Ss+ 01:22 0:00 /sbin/agetty 38400 tty2 linux And latest kernel that everything was working is: 4.14.3 Some observations - when i disable tso on all cards there is more memleak.
Re: [Intel-wired-lan] Instability of i40e driver on 4.9 kernel
e1000 sourceforge is a bad place to make anything with Your problems Just checked this now if something changed :) But when I post reply to some bug that i was have same problem somebody closed the ticket and delete my message :) So rly :) W dniu 2017-10-25 o 23:49, Pavlos Parissis pisze: On 21/10/2017 02:07 πμ, Fujinaka, Todd wrote: You picked a bunch of places to post this, and you really should've used a different place: e1000-de...@lists.sourceforge.net Just subscribed to that ML and mailed about it. Also, since you flagged the "communities" post as "answered", you're not likely to get any follow-up. The Intel communities are also not monitored as much by the wired networking people at Intel. I don't see it as "answered" when I visit the page, maybe the fact I have replied with extra information is confusing something. Anyway, it isn't that important since the "communities" posts aren't monitored by Intel people, which makes sense as it is very time consuming to monitor mailing lists and web forums at the same time. Please let us know if you have any specific issues, and please provide exact reproduction steps so we can investigate your issues, and please use e1000-devel. I hope the information I provided in my mail to e1000-devel is enough. Thanks, Pavlos
Re: intel i40e buggy driver question
W dniu 2017-10-28 o 00:34, Paweł Staszewski pisze: Hi I have many problems with 40e driver memleaks , kernel panics , stack traces , tx hungx , tx timeouts and many many others :) But the main problem that can't be resolved in linux is resolved in freebsd problem in freebsd with this: [2501243.181829] i40e :01:00.1 eno2: VSI_seid 390, Hung TX queue 17, tx_pending_hw: 1, NTC:0x16b, HWB: 0x16b, NTU: 0x16c, TAIL: 0x16c [2501243.181835] i40e :01:00.1 eno2: VSI_seid 390, Issuing force_wb for TX queue 17, Interrupt Reg: 0x0 Was solved by this: " change this piece in ixl_tso_detect_sparse() in ixl_txrx.c: if (mss < 1) { if (num > IXL_SPARSE_CHAIN) return (true); num = (mss == 0) ? 0 : 1; mss += mp->m_pkthdr.tso_segsz; } to if (num > IXL_SPARSE_CHAIN) return (true); if (mss < 1) { num = (mss == 0) ? 0 : 1; mss += mp->m_pkthdr.tso_segsz; } Intel FreeBSD Team: This will definitely prevent MDDs on the buffers you sent me. " An I have a question - how to do the same in linux ? :) Cause i have same problem in Linux with this i40e buggy driver: [224051.287277] WARNING: CPU: 3 PID: 25031 at drivers/net/ethernet/intel/i40e/i40e_txrx.c:1248 i40e_setup_rx_descriptors+0x15/0xa9 [224051.287278] Modules linked in: team_mode_roundrobin team x86_pkg_temp_thermal ipmi_si [224051.287327] CPU: 3 PID: 25031 Comm: ip Tainted: G W 4.12.14 #2 [224051.287330] task: 880859e09880 task.stack: c900036ec000 [224051.287332] RIP: 0010:i40e_setup_rx_descriptors+0x15/0xa9 [224051.287332] RSP: 0018:c900036ef6e8 EFLAGS: 00010286 [224051.287333] RAX: 8808595eda00 RBX: 880856d36d00 RCX: 014000c1 [224051.287334] RDX: 0001 RSI: 880844418000 RDI: 880856d36d00 [224051.287334] RBP: c900036ef6f8 R08: 0001ccc3 R09: ea0021110620 [224051.287335] R10: R11: 88087effae90 R12: 8808590300a0 [224051.287335] R13: 0002 R14: fff0 R15: 0001 [224051.287336] FS: 7f1e4658b740() GS:88085e2c() knlGS: [224051.287337] CS: 0010 DS: ES: CR0: 80050033 [224051.287337] CR2: 7ffd7479 CR3: 00059e2f4000 CR4: 001406e0 [224051.287338] Call Trace: [224051.287339] i40e_vsi_open+0x7d/0x1e7 [224051.287341] i40e_open+0x4d/0xc3 [224051.287342] __dev_open+0x8b/0xcd [224051.287344] __dev_change_flags+0xa2/0x13d [224051.287346] dev_change_flags+0x20/0x53 [224051.287347] do_setlink+0x2d0/0xad6 [224051.287349] ? zone_statistics+0x5a/0x61 [224051.287350] ? get_page_from_freelist+0x4c8/0x627 [224051.287352] rtnl_newlink+0x391/0x6d6 [224051.287353] ? netdev_master_upper_dev_get+0xd/0x57 [224051.287354] ? rtnl_newlink+0x106/0x6d6 [224051.287356] ? alloc_pages_vma+0x8c/0x17a [224051.287357] ? pagevec_lru_move_fn+0x20/0xc1 [224051.287359] ? lru_cache_add_active_or_unevictable+0x27/0x7a [224051.287360] ? __handle_mm_fault+0x4c1/0x8ae [224051.287362] rtnetlink_rcv_msg+0x166/0x173 [224051.287363] ? __kmalloc_node_track_caller+0x11f/0x12f [224051.287365] ? __alloc_skb+0x89/0x175 [224051.287366] ? rtnl_newlink+0x6d6/0x6d6 [224051.287367] netlink_rcv_skb+0x57/0xa0 [224051.287369] rtnetlink_rcv+0x1e/0x25 [224051.287371] netlink_unicast+0x103/0x187 [224051.287372] netlink_sendmsg+0x28d/0x2ad [224051.287374] sock_sendmsg_nosec+0x12/0x1d [224051.287375] ___sys_sendmsg+0x19d/0x217 [224051.287377] ? kmem_cache_free+0x4b/0xf3 [224051.287492] ? alloc_pages_vma+0x147/0x17a [224051.287494] ? __page_set_anon_rmap+0x24/0x65 [224051.287495] ? get_page+0x9/0xf [224051.287496] ? __lru_cache_add+0x18/0x47 [224051.287498] ? __handle_mm_fault+0x4c1/0x8ae [224051.287499] __sys_sendmsg+0x40/0x5e [224051.287564] ? __sys_sendmsg+0x40/0x5e [224051.287566] SyS_sendmsg+0xd/0x17 [224051.287567] entry_SYSCALL_64_fastpath+0x13/0x94 [224051.287568] RIP: 0033:0x7f1e45cac620 [224051.287569] RSP: 002b:7ffd7478b4d8 EFLAGS: 0246 ORIG_RAX: 002e [224051.287570] RAX: ffda RBX: RCX: 7f1e45cac620 [224051.287571] RDX: RSI: 7ffd7478b520 RDI: 0003 [224051.287572] RBP: 7ffd7478b520 R08: 0001 R09: fefefeff77686d74 [224051.287572] R10: 05e6 R11: 0246 R12: 7ffd7478b560 [224051.287573] R13: 006724c0 R14: 7ffd747935e0 R15: [224051.287574] Code: 00 00 48 8b 7b 10 e8 41 f2 ff ff 48 c7 43 08 00 00 00 00 5b 5d c3 55 48 89 e5 41 54 53 48 83 7f 20 00 48 89 fb 4c 8b 67 10 74 02 <0f> ff 0f b7 7b 44 48 6b ff 18 e8 65 f5 ff ff 48 85 c0 48 89 43 [224051.287597] ---[ end trace a9810da52af61a5a ]--- [224051.287607] [ cut here ] [224051.287609] WAR
intel i40e buggy driver question
Hi I have many problems with 40e driver memleaks , kernel panics , stack traces , tx hungx , tx timeouts and many many others :) But the main problem that can't be resolved in linux is resolved in freebsd problem in freebsd with this: [2501243.181829] i40e :01:00.1 eno2: VSI_seid 390, Hung TX queue 17, tx_pending_hw: 1, NTC:0x16b, HWB: 0x16b, NTU: 0x16c, TAIL: 0x16c [2501243.181835] i40e :01:00.1 eno2: VSI_seid 390, Issuing force_wb for TX queue 17, Interrupt Reg: 0x0 Was solved by this: " change this piece in ixl_tso_detect_sparse() in ixl_txrx.c: if (mss < 1) { if (num > IXL_SPARSE_CHAIN) return (true); num = (mss == 0) ? 0 : 1; mss += mp->m_pkthdr.tso_segsz; } to if (num > IXL_SPARSE_CHAIN) return (true); if (mss < 1) { num = (mss == 0) ? 0 : 1; mss += mp->m_pkthdr.tso_segsz; } Intel FreeBSD Team: This will definitely prevent MDDs on the buffers you sent me. " An I have a question - how to do the same in linux ? :) Cause i have same problem in Linux with this i40e buggy driver: [224051.287277] WARNING: CPU: 3 PID: 25031 at drivers/net/ethernet/intel/i40e/i40e_txrx.c:1248 i40e_setup_rx_descriptors+0x15/0xa9 [224051.287278] Modules linked in: team_mode_roundrobin team x86_pkg_temp_thermal ipmi_si [224051.287327] CPU: 3 PID: 25031 Comm: ip Tainted: G W 4.12.14 #2 [224051.287330] task: 880859e09880 task.stack: c900036ec000 [224051.287332] RIP: 0010:i40e_setup_rx_descriptors+0x15/0xa9 [224051.287332] RSP: 0018:c900036ef6e8 EFLAGS: 00010286 [224051.287333] RAX: 8808595eda00 RBX: 880856d36d00 RCX: 014000c1 [224051.287334] RDX: 0001 RSI: 880844418000 RDI: 880856d36d00 [224051.287334] RBP: c900036ef6f8 R08: 0001ccc3 R09: ea0021110620 [224051.287335] R10: R11: 88087effae90 R12: 8808590300a0 [224051.287335] R13: 0002 R14: fff0 R15: 0001 [224051.287336] FS: 7f1e4658b740() GS:88085e2c() knlGS: [224051.287337] CS: 0010 DS: ES: CR0: 80050033 [224051.287337] CR2: 7ffd7479 CR3: 00059e2f4000 CR4: 001406e0 [224051.287338] Call Trace: [224051.287339] i40e_vsi_open+0x7d/0x1e7 [224051.287341] i40e_open+0x4d/0xc3 [224051.287342] __dev_open+0x8b/0xcd [224051.287344] __dev_change_flags+0xa2/0x13d [224051.287346] dev_change_flags+0x20/0x53 [224051.287347] do_setlink+0x2d0/0xad6 [224051.287349] ? zone_statistics+0x5a/0x61 [224051.287350] ? get_page_from_freelist+0x4c8/0x627 [224051.287352] rtnl_newlink+0x391/0x6d6 [224051.287353] ? netdev_master_upper_dev_get+0xd/0x57 [224051.287354] ? rtnl_newlink+0x106/0x6d6 [224051.287356] ? alloc_pages_vma+0x8c/0x17a [224051.287357] ? pagevec_lru_move_fn+0x20/0xc1 [224051.287359] ? lru_cache_add_active_or_unevictable+0x27/0x7a [224051.287360] ? __handle_mm_fault+0x4c1/0x8ae [224051.287362] rtnetlink_rcv_msg+0x166/0x173 [224051.287363] ? __kmalloc_node_track_caller+0x11f/0x12f [224051.287365] ? __alloc_skb+0x89/0x175 [224051.287366] ? rtnl_newlink+0x6d6/0x6d6 [224051.287367] netlink_rcv_skb+0x57/0xa0 [224051.287369] rtnetlink_rcv+0x1e/0x25 [224051.287371] netlink_unicast+0x103/0x187 [224051.287372] netlink_sendmsg+0x28d/0x2ad [224051.287374] sock_sendmsg_nosec+0x12/0x1d [224051.287375] ___sys_sendmsg+0x19d/0x217 [224051.287377] ? kmem_cache_free+0x4b/0xf3 [224051.287492] ? alloc_pages_vma+0x147/0x17a [224051.287494] ? __page_set_anon_rmap+0x24/0x65 [224051.287495] ? get_page+0x9/0xf [224051.287496] ? __lru_cache_add+0x18/0x47 [224051.287498] ? __handle_mm_fault+0x4c1/0x8ae [224051.287499] __sys_sendmsg+0x40/0x5e [224051.287564] ? __sys_sendmsg+0x40/0x5e [224051.287566] SyS_sendmsg+0xd/0x17 [224051.287567] entry_SYSCALL_64_fastpath+0x13/0x94 [224051.287568] RIP: 0033:0x7f1e45cac620 [224051.287569] RSP: 002b:7ffd7478b4d8 EFLAGS: 0246 ORIG_RAX: 002e [224051.287570] RAX: ffda RBX: RCX: 7f1e45cac620 [224051.287571] RDX: RSI: 7ffd7478b520 RDI: 0003 [224051.287572] RBP: 7ffd7478b520 R08: 0001 R09: fefefeff77686d74 [224051.287572] R10: 05e6 R11: 0246 R12: 7ffd7478b560 [224051.287573] R13: 006724c0 R14: 7ffd747935e0 R15: [224051.287574] Code: 00 00 48 8b 7b 10 e8 41 f2 ff ff 48 c7 43 08 00 00 00 00 5b 5d c3 55 48 89 e5 41 54 53 48 83 7f 20 00 48 89 fb 4c 8b 67 10 74 02 <0f> ff 0f b7 7b 44 48 6b ff 18 e8 65 f5 ff ff 48 85 c0 48 89 43 [224051.287597] ---[ end trace a9810da52af61a5a ]--- [224051.287607] [ cut here ] [224051.287609] WARNING: CPU: 3 PID: 25031 at drivers/net/ethernet/intel/i40e/i40e_txrx.c:1248 i40e_setup_rx
Re: [Intel-wired-lan] [jkirsher/net-queue PATCH] i40e: Add programming descriptors to cleaned_count
from today it is in net.git https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/log/?qt=grep&q=i40e It will be later in net-next Also can You please tell me what firmware You are using with Your nics ? Those are X710 ? Thanks Paweł W dniu 2017-10-27 o 23:20, Pavlos Parissis pisze: On 23 October 2017 at 01:15, Paweł Staszewski wrote: Yes can confirm that after adding patch: [jkirsher/net-queue PATCH] i40e: Add programming descriptors to cleaned_count There is no memleak. Somehow this patch isn't present in the current net-next repo. Shouldn't it be there? Cheers, Pavlos
Re: [jkirsher/net-queue PATCH] i40e: Add programming descriptors to cleaned_count
Yes can confirm that after adding patch: [jkirsher/net-queue PATCH] i40e: Add programming descriptors to cleaned_count There is no memleak. W dniu 2017-10-22 o 20:01, Anders K. Pedersen | Cohaesio pisze: On lør, 2017-10-21 at 18:12 -0700, Alexander Duyck wrote: From: Alexander Duyck This patch updates the i40e driver to include programming descriptors in the cleaned_count. Without this change it becomes possible for us to leak memory as we don't trigger a large enough allocation when the time comes to allocate new buffers and we end up overwriting a number of rx_buffers equal to the number of programming descriptors we encountered. Fixes: 0e626ff7ccbf ("i40e: Fix support for flow director programming status") Signed-off-by: Alexander Duyck This patch solves the remaining memory leak we've seen, so Tested-by: Anders K. Pedersen Regards, Anders
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-19 o 01:56, Paweł Staszewski pisze: W dniu 2017-10-19 o 01:51, Paweł Staszewski pisze: W dniu 2017-10-19 o 01:37, Alexander Duyck pisze: On Wed, Oct 18, 2017 at 4:22 PM, Paweł Staszewski wrote: W dniu 2017-10-19 o 00:58, Paweł Staszewski pisze: W dniu 2017-10-19 o 00:50, Paweł Staszewski pisze: W dniu 2017-10-19 o 00:20, Paweł Staszewski pisze: W dniu 2017-10-18 o 17:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 16:08, Paweł Staszewski pisze: W dniu 2017-10-17 o 13:52, Paweł Staszewski pisze: W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze: W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze: W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 01:56, Alexander Duyck pisze: On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski wrote: W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze: W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :) Upgraded and looks like problem is not solved with that patch Currently running system with https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ kernel Still about 0.5GB of memory is leaking somewhere Also can confirm that the latest kernel where memory is not leaking (with use i40e driver intel 710 cards) is 4.11.12 With kernel 4.11.12 - after hour no change in memory usage. also checked that with ixgbe instead of i40e with same net.git kernel there is no memleak - after hour same memory usage - so for 100% this is i40e driver problem. So how long was the run to get the .5GB of memory leaking? 1 hour Also is there any chance of you being able to bisect to determine where the memory leak was introduced since as you pointed out it didn't exist in 4.11.12 so odds are it was introduced somewhere between 4.11 and the latest kernel release. Can be hard cause currently need to back to 4.11.12 - this is production host/router Will try to find some free/test router for tests/bicects with i40e driver (intel 710 cards) Thanks. - Alex Also forgoto to add errors for i40e when driver initialize: [ 15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on some params that are set for this nic's ip link set up dev $i ethtool -A $i autoneg off rx off tx off ethtool -G $i rx 1024 tx 2048 ip link set $i txqueuelen 1000 ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -L $i combined 6 #ethtool -N $i rx-flow-hash udp4 sdfn ethtool -K $i ntuple on ethtool -K $i gro off ethtool -K $i tso off Also after TSO/GRO on there is memory usage change - and leaking faster Below image from memory usage before change with TSO/GRO OFF and after enabling TSO/GRO https://ibb.co/dTqBY6 Thanks Pawel With settings like this: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro on ethtool -K $i tso on done Server is leaking about 4-6MB per each 10 seconds MEMLEAK: 5 MB/10sec 6 MB/10sec 4 MB/10sec 4 MB/10sec Other settings TSO/GRO off ifc='enp2s0f0 enp2s0f1 enp2s0f2 e
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-19 o 01:51, Paweł Staszewski pisze: W dniu 2017-10-19 o 01:37, Alexander Duyck pisze: On Wed, Oct 18, 2017 at 4:22 PM, Paweł Staszewski wrote: W dniu 2017-10-19 o 00:58, Paweł Staszewski pisze: W dniu 2017-10-19 o 00:50, Paweł Staszewski pisze: W dniu 2017-10-19 o 00:20, Paweł Staszewski pisze: W dniu 2017-10-18 o 17:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 16:08, Paweł Staszewski pisze: W dniu 2017-10-17 o 13:52, Paweł Staszewski pisze: W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze: W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze: W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 01:56, Alexander Duyck pisze: On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski wrote: W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze: W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :) Upgraded and looks like problem is not solved with that patch Currently running system with https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ kernel Still about 0.5GB of memory is leaking somewhere Also can confirm that the latest kernel where memory is not leaking (with use i40e driver intel 710 cards) is 4.11.12 With kernel 4.11.12 - after hour no change in memory usage. also checked that with ixgbe instead of i40e with same net.git kernel there is no memleak - after hour same memory usage - so for 100% this is i40e driver problem. So how long was the run to get the .5GB of memory leaking? 1 hour Also is there any chance of you being able to bisect to determine where the memory leak was introduced since as you pointed out it didn't exist in 4.11.12 so odds are it was introduced somewhere between 4.11 and the latest kernel release. Can be hard cause currently need to back to 4.11.12 - this is production host/router Will try to find some free/test router for tests/bicects with i40e driver (intel 710 cards) Thanks. - Alex Also forgoto to add errors for i40e when driver initialize: [ 15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on some params that are set for this nic's ip link set up dev $i ethtool -A $i autoneg off rx off tx off ethtool -G $i rx 1024 tx 2048 ip link set $i txqueuelen 1000 ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -L $i combined 6 #ethtool -N $i rx-flow-hash udp4 sdfn ethtool -K $i ntuple on ethtool -K $i gro off ethtool -K $i tso off Also after TSO/GRO on there is memory usage change - and leaking faster Below image from memory usage before change with TSO/GRO OFF and after enabling TSO/GRO https://ibb.co/dTqBY6 Thanks Pawel With settings like this: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro on ethtool -K $i tso on done Server is leaking about 4-6MB per each 10 seconds MEMLEAK: 5 MB/10sec 6 MB/10sec 4 MB/10sec 4 MB/10sec Other settings TSO/GRO off ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-19 o 01:37, Alexander Duyck pisze: On Wed, Oct 18, 2017 at 4:22 PM, Paweł Staszewski wrote: W dniu 2017-10-19 o 00:58, Paweł Staszewski pisze: W dniu 2017-10-19 o 00:50, Paweł Staszewski pisze: W dniu 2017-10-19 o 00:20, Paweł Staszewski pisze: W dniu 2017-10-18 o 17:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 16:08, Paweł Staszewski pisze: W dniu 2017-10-17 o 13:52, Paweł Staszewski pisze: W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze: W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze: W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 01:56, Alexander Duyck pisze: On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski wrote: W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze: W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :) Upgraded and looks like problem is not solved with that patch Currently running system with https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ kernel Still about 0.5GB of memory is leaking somewhere Also can confirm that the latest kernel where memory is not leaking (with use i40e driver intel 710 cards) is 4.11.12 With kernel 4.11.12 - after hour no change in memory usage. also checked that with ixgbe instead of i40e with same net.git kernel there is no memleak - after hour same memory usage - so for 100% this is i40e driver problem. So how long was the run to get the .5GB of memory leaking? 1 hour Also is there any chance of you being able to bisect to determine where the memory leak was introduced since as you pointed out it didn't exist in 4.11.12 so odds are it was introduced somewhere between 4.11 and the latest kernel release. Can be hard cause currently need to back to 4.11.12 - this is production host/router Will try to find some free/test router for tests/bicects with i40e driver (intel 710 cards) Thanks. - Alex Also forgoto to add errors for i40e when driver initialize: [ 15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on some params that are set for this nic's ip link set up dev $i ethtool -A $i autoneg off rx off tx off ethtool -G $i rx 1024 tx 2048 ip link set $i txqueuelen 1000 ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -L $i combined 6 #ethtool -N $i rx-flow-hash udp4 sdfn ethtool -K $i ntuple on ethtool -K $i gro off ethtool -K $i tso off Also after TSO/GRO on there is memory usage change - and leaking faster Below image from memory usage before change with TSO/GRO OFF and after enabling TSO/GRO https://ibb.co/dTqBY6 Thanks Pawel With settings like this: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro on ethtool -K $i tso on done Server is leaking about 4-6MB per each 10 seconds MEMLEAK: 5 MB/10sec 6 MB/10sec 4 MB/10sec 4 MB/10sec Other settings TSO/GRO off ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 12
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-19 o 01:29, Alexander Duyck pisze: On Mon, Oct 16, 2017 at 10:51 PM, Vitezslav Samel wrote: On Tue, Oct 17, 2017 at 01:34:29AM +0200, Paweł Staszewski wrote: W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze: W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :) Upgraded and looks like problem is not solved with that patch Currently running system with https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ kernel Still about 0.5GB of memory is leaking somewhere Also can confirm that the latest kernel where memory is not leaking (with use i40e driver intel 710 cards) is 4.11.12 With kernel 4.11.12 - after hour no change in memory usage. also checked that with ixgbe instead of i40e with same net.git kernel there is no memleak - after hour same memory usage - so for 100% this is i40e driver problem. I have (probably) the same problem here but with X520 cards: booting 4.12.x gives me oops after circa 20 minutes of our workload. Booting 4.9.y is OK. This machine is in production so any testing is very limited. Machine was stable for >2 months (on the desk before got to production) with 4.12.8 but with no traffic on X520 cards. Cheers, Vita Sorry but it can't be the same issue since we are discussing a different driver (i40e) running different hardware (X710 or XL170). You might want to start a new thread for your issue, and/or if possible file a bug on e1000.sf.net. Thanks. - Alex sorry but bugs reported on e1000.sf.net are delayed - some after about 6 or more months - when i reported first bug there iv got reply after a year about no activity :):) haha - and reported there bug is still actrive :) better for me is now to change nics (for sure cheaper from the perspective of clients :) ) to mellanox or just to replace and use ixgbe - that have no this bug (mellanox and ixgbe have no such bug - have many servers with them with same conf - and only one with i40e where is same conf and memleak) If nobody from Intel wants to reproduce this - qool - this is not my problem but intels :) - there is now many good nics to use - like mellanox or just stick with many 10G based on ixgbe that is really good driver - but really ? intel guys have no XL710 cards ? i dont want to buy another buggy cards to do only kernel bisects sorry To do good bisects with this bug You need to spend maybee 200/300 bisects - and to confirm each - You need maybee 30minutes so count how much time You need - more that 100 cards in price from mellanox maybee :) so imagine what i will do :) Thanks Paweł
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-19 o 00:58, Paweł Staszewski pisze: W dniu 2017-10-19 o 00:50, Paweł Staszewski pisze: W dniu 2017-10-19 o 00:20, Paweł Staszewski pisze: W dniu 2017-10-18 o 17:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 16:08, Paweł Staszewski pisze: W dniu 2017-10-17 o 13:52, Paweł Staszewski pisze: W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze: W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze: W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 01:56, Alexander Duyck pisze: On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski wrote: W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze: W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :) Upgraded and looks like problem is not solved with that patch Currently running system with https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ kernel Still about 0.5GB of memory is leaking somewhere Also can confirm that the latest kernel where memory is not leaking (with use i40e driver intel 710 cards) is 4.11.12 With kernel 4.11.12 - after hour no change in memory usage. also checked that with ixgbe instead of i40e with same net.git kernel there is no memleak - after hour same memory usage - so for 100% this is i40e driver problem. So how long was the run to get the .5GB of memory leaking? 1 hour Also is there any chance of you being able to bisect to determine where the memory leak was introduced since as you pointed out it didn't exist in 4.11.12 so odds are it was introduced somewhere between 4.11 and the latest kernel release. Can be hard cause currently need to back to 4.11.12 - this is production host/router Will try to find some free/test router for tests/bicects with i40e driver (intel 710 cards) Thanks. - Alex Also forgoto to add errors for i40e when driver initialize: [ 15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on some params that are set for this nic's ip link set up dev $i ethtool -A $i autoneg off rx off tx off ethtool -G $i rx 1024 tx 2048 ip link set $i txqueuelen 1000 ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -L $i combined 6 #ethtool -N $i rx-flow-hash udp4 sdfn ethtool -K $i ntuple on ethtool -K $i gro off ethtool -K $i tso off Also after TSO/GRO on there is memory usage change - and leaking faster Below image from memory usage before change with TSO/GRO OFF and after enabling TSO/GRO https://ibb.co/dTqBY6 Thanks Pawel With settings like this: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro on ethtool -K $i tso on done Server is leaking about 4-6MB per each 10 seconds MEMLEAK: 5 MB/10sec 6 MB/10sec 4 MB/10sec 4 MB/10sec Other settings TSO/GRO off ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro off ethtool -K $i tso off done Same l
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-19 o 00:50, Paweł Staszewski pisze: W dniu 2017-10-19 o 00:20, Paweł Staszewski pisze: W dniu 2017-10-18 o 17:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 16:08, Paweł Staszewski pisze: W dniu 2017-10-17 o 13:52, Paweł Staszewski pisze: W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze: W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze: W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 01:56, Alexander Duyck pisze: On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski wrote: W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze: W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :) Upgraded and looks like problem is not solved with that patch Currently running system with https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ kernel Still about 0.5GB of memory is leaking somewhere Also can confirm that the latest kernel where memory is not leaking (with use i40e driver intel 710 cards) is 4.11.12 With kernel 4.11.12 - after hour no change in memory usage. also checked that with ixgbe instead of i40e with same net.git kernel there is no memleak - after hour same memory usage - so for 100% this is i40e driver problem. So how long was the run to get the .5GB of memory leaking? 1 hour Also is there any chance of you being able to bisect to determine where the memory leak was introduced since as you pointed out it didn't exist in 4.11.12 so odds are it was introduced somewhere between 4.11 and the latest kernel release. Can be hard cause currently need to back to 4.11.12 - this is production host/router Will try to find some free/test router for tests/bicects with i40e driver (intel 710 cards) Thanks. - Alex Also forgoto to add errors for i40e when driver initialize: [ 15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on some params that are set for this nic's ip link set up dev $i ethtool -A $i autoneg off rx off tx off ethtool -G $i rx 1024 tx 2048 ip link set $i txqueuelen 1000 ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -L $i combined 6 #ethtool -N $i rx-flow-hash udp4 sdfn ethtool -K $i ntuple on ethtool -K $i gro off ethtool -K $i tso off Also after TSO/GRO on there is memory usage change - and leaking faster Below image from memory usage before change with TSO/GRO OFF and after enabling TSO/GRO https://ibb.co/dTqBY6 Thanks Pawel With settings like this: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro on ethtool -K $i tso on done Server is leaking about 4-6MB per each 10 seconds MEMLEAK: 5 MB/10sec 6 MB/10sec 4 MB/10sec 4 MB/10sec Other settings TSO/GRO off ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro off ethtool -K $i tso off done Same leak about 5MB per 10 seconds MEMLEAK: 5 MB/10sec 5 MB/
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-19 o 00:20, Paweł Staszewski pisze: W dniu 2017-10-18 o 17:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 16:08, Paweł Staszewski pisze: W dniu 2017-10-17 o 13:52, Paweł Staszewski pisze: W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze: W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze: W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 01:56, Alexander Duyck pisze: On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski wrote: W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze: W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :) Upgraded and looks like problem is not solved with that patch Currently running system with https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ kernel Still about 0.5GB of memory is leaking somewhere Also can confirm that the latest kernel where memory is not leaking (with use i40e driver intel 710 cards) is 4.11.12 With kernel 4.11.12 - after hour no change in memory usage. also checked that with ixgbe instead of i40e with same net.git kernel there is no memleak - after hour same memory usage - so for 100% this is i40e driver problem. So how long was the run to get the .5GB of memory leaking? 1 hour Also is there any chance of you being able to bisect to determine where the memory leak was introduced since as you pointed out it didn't exist in 4.11.12 so odds are it was introduced somewhere between 4.11 and the latest kernel release. Can be hard cause currently need to back to 4.11.12 - this is production host/router Will try to find some free/test router for tests/bicects with i40e driver (intel 710 cards) Thanks. - Alex Also forgoto to add errors for i40e when driver initialize: [ 15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on some params that are set for this nic's ip link set up dev $i ethtool -A $i autoneg off rx off tx off ethtool -G $i rx 1024 tx 2048 ip link set $i txqueuelen 1000 ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -L $i combined 6 #ethtool -N $i rx-flow-hash udp4 sdfn ethtool -K $i ntuple on ethtool -K $i gro off ethtool -K $i tso off Also after TSO/GRO on there is memory usage change - and leaking faster Below image from memory usage before change with TSO/GRO OFF and after enabling TSO/GRO https://ibb.co/dTqBY6 Thanks Pawel With settings like this: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro on ethtool -K $i tso on done Server is leaking about 4-6MB per each 10 seconds MEMLEAK: 5 MB/10sec 6 MB/10sec 4 MB/10sec 4 MB/10sec Other settings TSO/GRO off ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro off ethtool -K $i tso off done Same leak about 5MB per 10 seconds MEMLEAK: 5 MB/10sec 5 MB/10sec 5 MB/10sec Other settings rx-usecs change fro
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
W dniu 2017-10-18 o 23:54, Eric Dumazet pisze: On Wed, 2017-10-18 at 23:49 +0200, Paweł Staszewski wrote: How far it is from applying this to the kernel ? So far im using this on all my servers from about 3 months now without problems It is a hack, and does not support properly bonding/team. ( If the real_dev->privflags IFF_XMIT_DST_RELEASE bit changes, we want to update all the vlans at the same time ) We need something more sophisticated, and I had no time to spend on this topic recently. ok
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-18 o 17:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 16:08, Paweł Staszewski pisze: W dniu 2017-10-17 o 13:52, Paweł Staszewski pisze: W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze: W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze: W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 01:56, Alexander Duyck pisze: On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski wrote: W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze: W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :) Upgraded and looks like problem is not solved with that patch Currently running system with https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ kernel Still about 0.5GB of memory is leaking somewhere Also can confirm that the latest kernel where memory is not leaking (with use i40e driver intel 710 cards) is 4.11.12 With kernel 4.11.12 - after hour no change in memory usage. also checked that with ixgbe instead of i40e with same net.git kernel there is no memleak - after hour same memory usage - so for 100% this is i40e driver problem. So how long was the run to get the .5GB of memory leaking? 1 hour Also is there any chance of you being able to bisect to determine where the memory leak was introduced since as you pointed out it didn't exist in 4.11.12 so odds are it was introduced somewhere between 4.11 and the latest kernel release. Can be hard cause currently need to back to 4.11.12 - this is production host/router Will try to find some free/test router for tests/bicects with i40e driver (intel 710 cards) Thanks. - Alex Also forgoto to add errors for i40e when driver initialize: [ 15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on some params that are set for this nic's ip link set up dev $i ethtool -A $i autoneg off rx off tx off ethtool -G $i rx 1024 tx 2048 ip link set $i txqueuelen 1000 ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -L $i combined 6 #ethtool -N $i rx-flow-hash udp4 sdfn ethtool -K $i ntuple on ethtool -K $i gro off ethtool -K $i tso off Also after TSO/GRO on there is memory usage change - and leaking faster Below image from memory usage before change with TSO/GRO OFF and after enabling TSO/GRO https://ibb.co/dTqBY6 Thanks Pawel With settings like this: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro on ethtool -K $i tso on done Server is leaking about 4-6MB per each 10 seconds MEMLEAK: 5 MB/10sec 6 MB/10sec 4 MB/10sec 4 MB/10sec Other settings TSO/GRO off ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro off ethtool -K $i tso off done Same leak about 5MB per 10 seconds MEMLEAK: 5 MB/10sec 5 MB/10sec 5 MB/10sec Other settings rx-usecs change from 512 to 1024: ifc='enp2s0f0 enp2s0f1 enp2s0f2 e
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
W dniu 2017-09-21 o 23:41, Florian Fainelli pisze: On 09/21/2017 02:26 PM, Paweł Staszewski wrote: W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze: diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c index 5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7 100644 --- a/net/8021q/vlan_netlink.c +++ b/net/8021q/vlan_netlink.c @@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, struct net_device *dev, vlan->vlan_proto = proto; vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]); vlan->real_dev = real_dev; +dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE); vlan->flags = VLAN_FLAG_REORDER_HDR; err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id); Any plans for this patch to go normal into the kernel ? Would not this apply to pretty much any stacked device setup though? It seems like any network device that just queues up its packet on another physical device for actual transmission may need that (e.g: DSA, bond, team, more.?) How far it is from applying this to the kernel ? So far im using this on all my servers from about 3 months now without problems
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-17 o 16:08, Paweł Staszewski pisze: W dniu 2017-10-17 o 13:52, Paweł Staszewski pisze: W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze: W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze: W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 01:56, Alexander Duyck pisze: On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski wrote: W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze: W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :) Upgraded and looks like problem is not solved with that patch Currently running system with https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ kernel Still about 0.5GB of memory is leaking somewhere Also can confirm that the latest kernel where memory is not leaking (with use i40e driver intel 710 cards) is 4.11.12 With kernel 4.11.12 - after hour no change in memory usage. also checked that with ixgbe instead of i40e with same net.git kernel there is no memleak - after hour same memory usage - so for 100% this is i40e driver problem. So how long was the run to get the .5GB of memory leaking? 1 hour Also is there any chance of you being able to bisect to determine where the memory leak was introduced since as you pointed out it didn't exist in 4.11.12 so odds are it was introduced somewhere between 4.11 and the latest kernel release. Can be hard cause currently need to back to 4.11.12 - this is production host/router Will try to find some free/test router for tests/bicects with i40e driver (intel 710 cards) Thanks. - Alex Also forgoto to add errors for i40e when driver initialize: [ 15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on some params that are set for this nic's ip link set up dev $i ethtool -A $i autoneg off rx off tx off ethtool -G $i rx 1024 tx 2048 ip link set $i txqueuelen 1000 ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -L $i combined 6 #ethtool -N $i rx-flow-hash udp4 sdfn ethtool -K $i ntuple on ethtool -K $i gro off ethtool -K $i tso off Also after TSO/GRO on there is memory usage change - and leaking faster Below image from memory usage before change with TSO/GRO OFF and after enabling TSO/GRO https://ibb.co/dTqBY6 Thanks Pawel With settings like this: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro on ethtool -K $i tso on done Server is leaking about 4-6MB per each 10 seconds MEMLEAK: 5 MB/10sec 6 MB/10sec 4 MB/10sec 4 MB/10sec Other settings TSO/GRO off ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro off ethtool -K $i tso off done Same leak about 5MB per 10 seconds MEMLEAK: 5 MB/10sec 5 MB/10sec 5 MB/10sec Other settings rx-usecs change from 512 to 1024: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-17 o 13:52, Paweł Staszewski pisze: W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze: W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze: W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 01:56, Alexander Duyck pisze: On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski wrote: W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze: W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :) Upgraded and looks like problem is not solved with that patch Currently running system with https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ kernel Still about 0.5GB of memory is leaking somewhere Also can confirm that the latest kernel where memory is not leaking (with use i40e driver intel 710 cards) is 4.11.12 With kernel 4.11.12 - after hour no change in memory usage. also checked that with ixgbe instead of i40e with same net.git kernel there is no memleak - after hour same memory usage - so for 100% this is i40e driver problem. So how long was the run to get the .5GB of memory leaking? 1 hour Also is there any chance of you being able to bisect to determine where the memory leak was introduced since as you pointed out it didn't exist in 4.11.12 so odds are it was introduced somewhere between 4.11 and the latest kernel release. Can be hard cause currently need to back to 4.11.12 - this is production host/router Will try to find some free/test router for tests/bicects with i40e driver (intel 710 cards) Thanks. - Alex Also forgoto to add errors for i40e when driver initialize: [ 15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on some params that are set for this nic's ip link set up dev $i ethtool -A $i autoneg off rx off tx off ethtool -G $i rx 1024 tx 2048 ip link set $i txqueuelen 1000 ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -L $i combined 6 #ethtool -N $i rx-flow-hash udp4 sdfn ethtool -K $i ntuple on ethtool -K $i gro off ethtool -K $i tso off Also after TSO/GRO on there is memory usage change - and leaking faster Below image from memory usage before change with TSO/GRO OFF and after enabling TSO/GRO https://ibb.co/dTqBY6 Thanks Pawel With settings like this: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro on ethtool -K $i tso on done Server is leaking about 4-6MB per each 10 seconds MEMLEAK: 5 MB/10sec 6 MB/10sec 4 MB/10sec 4 MB/10sec Other settings TSO/GRO off ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro off ethtool -K $i tso off done Same leak about 5MB per 10 seconds MEMLEAK: 5 MB/10sec 5 MB/10sec 5 MB/10sec Other settings rx-usecs change from 512 to 1024: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off a
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze: W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze: W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 01:56, Alexander Duyck pisze: On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski wrote: W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze: W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :) Upgraded and looks like problem is not solved with that patch Currently running system with https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ kernel Still about 0.5GB of memory is leaking somewhere Also can confirm that the latest kernel where memory is not leaking (with use i40e driver intel 710 cards) is 4.11.12 With kernel 4.11.12 - after hour no change in memory usage. also checked that with ixgbe instead of i40e with same net.git kernel there is no memleak - after hour same memory usage - so for 100% this is i40e driver problem. So how long was the run to get the .5GB of memory leaking? 1 hour Also is there any chance of you being able to bisect to determine where the memory leak was introduced since as you pointed out it didn't exist in 4.11.12 so odds are it was introduced somewhere between 4.11 and the latest kernel release. Can be hard cause currently need to back to 4.11.12 - this is production host/router Will try to find some free/test router for tests/bicects with i40e driver (intel 710 cards) Thanks. - Alex Also forgoto to add errors for i40e when driver initialize: [ 15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on some params that are set for this nic's ip link set up dev $i ethtool -A $i autoneg off rx off tx off ethtool -G $i rx 1024 tx 2048 ip link set $i txqueuelen 1000 ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -L $i combined 6 #ethtool -N $i rx-flow-hash udp4 sdfn ethtool -K $i ntuple on ethtool -K $i gro off ethtool -K $i tso off Also after TSO/GRO on there is memory usage change - and leaking faster Below image from memory usage before change with TSO/GRO OFF and after enabling TSO/GRO https://ibb.co/dTqBY6 Thanks Pawel With settings like this: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro on ethtool -K $i tso on done Server is leaking about 4-6MB per each 10 seconds MEMLEAK: 5 MB/10sec 6 MB/10sec 4 MB/10sec 4 MB/10sec Other settings TSO/GRO off ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro off ethtool -K $i tso off done Same leak about 5MB per 10 seconds MEMLEAK: 5 MB/10sec 5 MB/10sec 5 MB/10sec Other settings rx-usecs change from 512 to 1024: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 1024 tx-usecs 128
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze: W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze: W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 01:56, Alexander Duyck pisze: On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski wrote: W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze: W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :) Upgraded and looks like problem is not solved with that patch Currently running system with https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ kernel Still about 0.5GB of memory is leaking somewhere Also can confirm that the latest kernel where memory is not leaking (with use i40e driver intel 710 cards) is 4.11.12 With kernel 4.11.12 - after hour no change in memory usage. also checked that with ixgbe instead of i40e with same net.git kernel there is no memleak - after hour same memory usage - so for 100% this is i40e driver problem. So how long was the run to get the .5GB of memory leaking? 1 hour Also is there any chance of you being able to bisect to determine where the memory leak was introduced since as you pointed out it didn't exist in 4.11.12 so odds are it was introduced somewhere between 4.11 and the latest kernel release. Can be hard cause currently need to back to 4.11.12 - this is production host/router Will try to find some free/test router for tests/bicects with i40e driver (intel 710 cards) Thanks. - Alex Also forgoto to add errors for i40e when driver initialize: [ 15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on some params that are set for this nic's ip link set up dev $i ethtool -A $i autoneg off rx off tx off ethtool -G $i rx 1024 tx 2048 ip link set $i txqueuelen 1000 ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -L $i combined 6 #ethtool -N $i rx-flow-hash udp4 sdfn ethtool -K $i ntuple on ethtool -K $i gro off ethtool -K $i tso off Also after TSO/GRO on there is memory usage change - and leaking faster Below image from memory usage before change with TSO/GRO OFF and after enabling TSO/GRO https://ibb.co/dTqBY6 Thanks Pawel With settings like this: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro on ethtool -K $i tso on done Server is leaking about 4-6MB per each 10 seconds MEMLEAK: 5 MB/10sec 6 MB/10sec 4 MB/10sec 4 MB/10sec Other settings TSO/GRO off ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro off ethtool -K $i tso off done Same leak about 5MB per 10 seconds MEMLEAK: 5 MB/10sec 5 MB/10sec 5 MB/10sec Other settings rx-usecs change from 512 to 1024: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 1024 tx-usecs 128 ethtool -K $i gro off ethtool -K $i tso off
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze: W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze: W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze: W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 01:56, Alexander Duyck pisze: On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski wrote: W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze: W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :) Upgraded and looks like problem is not solved with that patch Currently running system with https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ kernel Still about 0.5GB of memory is leaking somewhere Also can confirm that the latest kernel where memory is not leaking (with use i40e driver intel 710 cards) is 4.11.12 With kernel 4.11.12 - after hour no change in memory usage. also checked that with ixgbe instead of i40e with same net.git kernel there is no memleak - after hour same memory usage - so for 100% this is i40e driver problem. So how long was the run to get the .5GB of memory leaking? 1 hour Also is there any chance of you being able to bisect to determine where the memory leak was introduced since as you pointed out it didn't exist in 4.11.12 so odds are it was introduced somewhere between 4.11 and the latest kernel release. Can be hard cause currently need to back to 4.11.12 - this is production host/router Will try to find some free/test router for tests/bicects with i40e driver (intel 710 cards) Thanks. - Alex Also forgoto to add errors for i40e when driver initialize: [ 15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on some params that are set for this nic's ip link set up dev $i ethtool -A $i autoneg off rx off tx off ethtool -G $i rx 1024 tx 2048 ip link set $i txqueuelen 1000 ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -L $i combined 6 #ethtool -N $i rx-flow-hash udp4 sdfn ethtool -K $i ntuple on ethtool -K $i gro off ethtool -K $i tso off Also after TSO/GRO on there is memory usage change - and leaking faster Below image from memory usage before change with TSO/GRO OFF and after enabling TSO/GRO https://ibb.co/dTqBY6 Thanks Pawel With settings like this: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro on ethtool -K $i tso on done Server is leaking about 4-6MB per each 10 seconds MEMLEAK: 5 MB/10sec 6 MB/10sec 4 MB/10sec 4 MB/10sec Other settings TSO/GRO off ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro off ethtool -K $i tso off done Same leak about 5MB per 10 seconds MEMLEAK: 5 MB/10sec 5 MB/10sec 5 MB/10sec Other settings rx-usecs change from 512 to 1024: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 1024 tx-usecs 128 ethtool -K $i gro off ethtool -K $i tso off done MEMLEAK: 4 MB/10sec 3 MB/10sec 4 MB/
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze: W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze: W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 01:56, Alexander Duyck pisze: On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski wrote: W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze: W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :) Upgraded and looks like problem is not solved with that patch Currently running system with https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ kernel Still about 0.5GB of memory is leaking somewhere Also can confirm that the latest kernel where memory is not leaking (with use i40e driver intel 710 cards) is 4.11.12 With kernel 4.11.12 - after hour no change in memory usage. also checked that with ixgbe instead of i40e with same net.git kernel there is no memleak - after hour same memory usage - so for 100% this is i40e driver problem. So how long was the run to get the .5GB of memory leaking? 1 hour Also is there any chance of you being able to bisect to determine where the memory leak was introduced since as you pointed out it didn't exist in 4.11.12 so odds are it was introduced somewhere between 4.11 and the latest kernel release. Can be hard cause currently need to back to 4.11.12 - this is production host/router Will try to find some free/test router for tests/bicects with i40e driver (intel 710 cards) Thanks. - Alex Also forgoto to add errors for i40e when driver initialize: [ 15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on some params that are set for this nic's ip link set up dev $i ethtool -A $i autoneg off rx off tx off ethtool -G $i rx 1024 tx 2048 ip link set $i txqueuelen 1000 ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -L $i combined 6 #ethtool -N $i rx-flow-hash udp4 sdfn ethtool -K $i ntuple on ethtool -K $i gro off ethtool -K $i tso off Also after TSO/GRO on there is memory usage change - and leaking faster Below image from memory usage before change with TSO/GRO OFF and after enabling TSO/GRO https://ibb.co/dTqBY6 Thanks Pawel With settings like this: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro on ethtool -K $i tso on done Server is leaking about 4-6MB per each 10 seconds MEMLEAK: 5 MB/10sec 6 MB/10sec 4 MB/10sec 4 MB/10sec Other settings TSO/GRO off ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -K $i gro off ethtool -K $i tso off done Same leak about 5MB per 10 seconds MEMLEAK: 5 MB/10sec 5 MB/10sec 5 MB/10sec Other settings rx-usecs change from 512 to 1024: ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 enp3s0f3' for i in $ifc do ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 1024 tx-usecs 128 ethtool -K $i gro off ethtool -K $i tso off done MEMLEAK: 4 MB/10sec 3 MB/10sec 4 MB/10sec 4 MB/10sec So memleak have something to do with
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze: W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 01:56, Alexander Duyck pisze: On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski wrote: W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze: W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :) Upgraded and looks like problem is not solved with that patch Currently running system with https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ kernel Still about 0.5GB of memory is leaking somewhere Also can confirm that the latest kernel where memory is not leaking (with use i40e driver intel 710 cards) is 4.11.12 With kernel 4.11.12 - after hour no change in memory usage. also checked that with ixgbe instead of i40e with same net.git kernel there is no memleak - after hour same memory usage - so for 100% this is i40e driver problem. So how long was the run to get the .5GB of memory leaking? 1 hour Also is there any chance of you being able to bisect to determine where the memory leak was introduced since as you pointed out it didn't exist in 4.11.12 so odds are it was introduced somewhere between 4.11 and the latest kernel release. Can be hard cause currently need to back to 4.11.12 - this is production host/router Will try to find some free/test router for tests/bicects with i40e driver (intel 710 cards) Thanks. - Alex Also forgoto to add errors for i40e when driver initialize: [ 15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on some params that are set for this nic's ip link set up dev $i ethtool -A $i autoneg off rx off tx off ethtool -G $i rx 1024 tx 2048 ip link set $i txqueuelen 1000 ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -L $i combined 6 #ethtool -N $i rx-flow-hash udp4 sdfn ethtool -K $i ntuple on ethtool -K $i gro off ethtool -K $i tso off Also after TSO/GRO on there is memory usage change - and leaking faster Below image from memory usage before change with TSO/GRO OFF and after enabling TSO/GRO https://ibb.co/dTqBY6 Thanks Pawel
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze: W dniu 2017-10-17 o 01:56, Alexander Duyck pisze: On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski wrote: W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze: W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :) Upgraded and looks like problem is not solved with that patch Currently running system with https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ kernel Still about 0.5GB of memory is leaking somewhere Also can confirm that the latest kernel where memory is not leaking (with use i40e driver intel 710 cards) is 4.11.12 With kernel 4.11.12 - after hour no change in memory usage. also checked that with ixgbe instead of i40e with same net.git kernel there is no memleak - after hour same memory usage - so for 100% this is i40e driver problem. So how long was the run to get the .5GB of memory leaking? 1 hour Also is there any chance of you being able to bisect to determine where the memory leak was introduced since as you pointed out it didn't exist in 4.11.12 so odds are it was introduced somewhere between 4.11 and the latest kernel release. Can be hard cause currently need to back to 4.11.12 - this is production host/router Will try to find some free/test router for tests/bicects with i40e driver (intel 710 cards) Thanks. - Alex Also forgoto to add errors for i40e when driver initialize: [ 15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on [ 16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on some params that are set for this nic's ip link set up dev $i ethtool -A $i autoneg off rx off tx off ethtool -G $i rx 1024 tx 2048 ip link set $i txqueuelen 1000 ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 tx-usecs 128 ethtool -L $i combined 6 #ethtool -N $i rx-flow-hash udp4 sdfn ethtool -K $i ntuple on ethtool -K $i gro off ethtool -K $i tso off
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-17 o 01:56, Alexander Duyck pisze: On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski wrote: W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze: W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :) Upgraded and looks like problem is not solved with that patch Currently running system with https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ kernel Still about 0.5GB of memory is leaking somewhere Also can confirm that the latest kernel where memory is not leaking (with use i40e driver intel 710 cards) is 4.11.12 With kernel 4.11.12 - after hour no change in memory usage. also checked that with ixgbe instead of i40e with same net.git kernel there is no memleak - after hour same memory usage - so for 100% this is i40e driver problem. So how long was the run to get the .5GB of memory leaking? 1 hour Also is there any chance of you being able to bisect to determine where the memory leak was introduced since as you pointed out it didn't exist in 4.11.12 so odds are it was introduced somewhere between 4.11 and the latest kernel release. Can be hard cause currently need to back to 4.11.12 - this is production host/router Will try to find some free/test router for tests/bicects with i40e driver (intel 710 cards) Thanks. - Alex
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze: W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :) Upgraded and looks like problem is not solved with that patch Currently running system with https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ kernel Still about 0.5GB of memory is leaking somewhere Also can confirm that the latest kernel where memory is not leaking (with use i40e driver intel 710 cards) is 4.11.12 With kernel 4.11.12 - after hour no change in memory usage. also checked that with ixgbe instead of i40e with same net.git kernel there is no memleak - after hour same memory usage - so for 100% this is i40e driver problem.
Re: Linux 4.12+ memory leak on router with i40e NICs
W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze: On 15/10/2017 02:58 πμ, Alexander Duyck wrote: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Do you know when it is going to be available on net-next and linux-stable repos? Cheers, Pavlos I will make some tests today night with "net" git tree where this patch is included. Starting from 0:00 CET :)
Re: Linux 4.12+ memory leak on router with i40e NICs
16:53 0:00 [kworker/4:3] root 20108 0.0 0.0 0 0 ? I 16:54 0:00 [kworker/3:2] root 20109 0.0 0.0 0 0 ? I 16:54 0:00 [kworker/3:3] root 20110 0.0 0.0 0 0 ? I 16:54 0:00 [kworker/0:6] root 20217 0.0 0.0 0 0 ? I 16:55 0:00 [kworker/1:0] root 20219 0.0 0.0 0 0 ? I 16:56 0:00 [kworker/9:1] root 20222 0.0 0.0 0 0 ? I 16:56 0:00 [kworker/9:3] root 20354 0.0 0.0 0 0 ? I 16:57 0:00 [kworker/5:0] root 20355 0.0 0.0 0 0 ? I 16:57 0:00 [kworker/5:3] root 20814 0.0 0.0 0 0 ? I 16:57 0:00 [kworker/u24:2] root 26845 0.0 0.0 0 0 ? I 15:40 0:00 [kworker/7:2] root 26979 0.0 0.0 0 0 ? I 15:43 0:00 [kworker/0:3] root 27375 0.0 0.0 0 0 ? I 15:48 0:00 [kworker/0:2] but free -m free -m total used free shared buff/cache available Mem: 32113 18345 13598 0 169 13419 Swap: 3911 0 3911 less and less about 0.5MB per hour it looks like this commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Is not included in: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git Ok will upgrade tomorrow - and will check with that fix. W dniu 2017-10-15 o 02:58, Alexander Duyck pisze: Hi Pawel, To clarify is that Dave Miller's tree or Linus's that you are talking about? If it is Dave's tree how long ago was it you pulled it since I think the fix was just pushed by Jeff Kirsher a few days ago. The issue should be fixed in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 Thanks. - Alex On Sat, Oct 14, 2017 at 3:03 PM, Paweł Staszewski wrote: Forgot to add - this graphs are tested with Kernel 4.14-rc4-next W dniu 2017-10-15 o 00:00, Paweł Staszewski pisze: Same problem here Also only difference is change 82599 intel to x710 and have memleak mem with ixgbe driver over time - same config saame kernel changed NIC's to x710 i40e driver (this is the only change) And mem over time: There is no process that is eating memory - looks like there is some problem with i40e driver - but it not a surprise :) this driver is really buggy - with many things - most tickets on e1000e sourceforge that i openned have no reply for year or more - or if somebody reply after year they are closing ticket after 1 day with info about no activity :) W dniu 2017-10-05 o 07:19, Anders K. Pedersen | Cohaesio pisze: On ons, 2017-10-04 at 08:32 -0700, Alexander Duyck wrote: On Wed, Oct 4, 2017 at 5:56 AM, Anders K. Pedersen | Cohaesio wrote: Hello, After updating one of our Linux based routers to kernel 4.13 it began leaking memory quite fast (about 1 GB every half hour). To narrow we tried various kernel versions and found that 4.11.12 is okay, while 4.12 also leaks, so we did a bisection between 4.11 and 4.12. The first bisection ended at "[6964e53f55837b0c49ed60d36656d2e0ee4fc27b] i40e: fix handling of HW ATR eviction", which fixes some flag handling that was broken by 47994c119a36 "i40e: remove hw_disabled_flags in favor of using separate flag bits", so I did a second bisection, where I added 6964e53f5583 "i40e: fix handling of HW ATR eviction" to the steps that had 47994c119a36 "i40e: remove hw_disabled_flags in favor of using separate flag bits" in them. The second bisection ended at "[0e626ff7ccbfc43c6cc4aeea611c40b899682382] i40e: Fix support for flow director programming status", where I don't see any obvious problems, so I'm hoping for some assistance. The router is a PowerEdge R730 server (Haswell based) with three Intel NICs (all using the i40e driver): X710 quad port 10 GbE SFP+: eth0 eth1 eth2 eth3 X710 quad port 10 GbE SFP+: eth4 eth5 eth6 eth7 XL710 dual port 40 GbE QSFP+: eth8 eth9 The NICs are aggregated with LACP with the team driver: team0: eth9 (40 GbE selected primary), and eth3, eth7 (10 GbE non- selected backups) team1: eth0, eth1, eth4, eth5 (all 10 GbE selected) team0 is used for internal networks and has one untagged and four tagged VLAN interfaces, while team1 has an external uplink connection without any VLANs. The router runs an eBGP session on team1 to one of our uplinks, and iBGP via team0 to our other border routers. It also runs OSPF on the internal VLANs on team0. One thing I've noticed is that when OSPF is not announcing a default gateway to the internal networks, so there is almost no traffic coming in on team0 and out on team1, but still
Re: Latest kernel net-next - 4.14-rc1+ / WARNING: CPU: 16 PID: 0 at net/sched/sch_hfsc.c:1385 hfsc_dequeue+0x241/0x269
A little more in trace: [49519.600903] [ cut here ] [49519.600908] WARNING: CPU: 7 PID: 31426 at net/sched/sch_hfsc.c:1385 hfsc_dequeue+0x241/0x269 [49519.600909] Modules linked in: ipmi_si x86_pkg_temp_thermal [49519.600914] CPU: 7 PID: 31426 Comm: syslog-ng Tainted: G W 4.14.0-rc1+ #10 [49519.600915] task: 88086d07c100 task.stack: c90006d54000 [49519.600917] RIP: 0010:hfsc_dequeue+0x241/0x269 [49519.600918] RSP: 0018:88087fa439f8 EFLAGS: 00010246 [49519.600919] RAX: RBX: 88085b8af148 RCX: 0018 [49519.600920] RDX: RSI: RDI: 88085b8af440 [49519.600922] RBP: 88087fa43a20 R08: 1600 R09: [49519.600923] R10: 88087fa43960 R11: 880859d50a00 R12: 88085b8af000 [49519.600924] R13: 00b4266fab9b R14: 0001 R15: 88085b8af440 [49519.600925] FS: 7fad63a35700() GS:88087fa4() knlGS: [49519.600926] CS: 0010 DS: ES: CR0: 80050033 [49519.600927] CR2: 7f10d4a90098 CR3: 00046bb1c005 CR4: 001606e0 [49519.600928] Call Trace: [49519.600929] [49519.600932] __qdisc_run+0xed/0x293 [49519.600935] __dev_queue_xmit+0x2d2/0x4b3 [49519.600936] ? eth_header+0x27/0xab [49519.600938] dev_queue_xmit+0xb/0xd [49519.600939] ? dev_queue_xmit+0xb/0xd [49519.600943] neigh_connected_output+0x9b/0xb2 [49519.600948] ip_finish_output2+0x24b/0x28f [49519.600952] ? statistic_mt+0x30/0x72 [49519.600954] ip_finish_output+0x101/0x10d [49519.600957] ip_output+0x56/0xa9 [49519.600959] ip_forward_finish+0x53/0x58 [49519.600961] ip_forward+0x2b2/0x308 [49519.600962] ? ip_frag_mem+0xf/0xf [49519.600964] ip_rcv_finish+0x27c/0x287 [49519.600965] ip_rcv+0x2b0/0x300 [49519.600968] ? vlan_do_receive+0x49/0x294 [49519.600970] __netif_receive_skb_core+0x312/0x496 [49519.600972] ? tk_clock_read+0xc/0xe [49519.600973] __netif_receive_skb+0x18/0x57 [49519.600974] ? __netif_receive_skb+0x18/0x57 [49519.600975] netif_receive_skb_internal+0x4b/0xa1 [49519.600977] napi_gro_complete+0x7a/0x7d [49519.600977] napi_gro_flush+0x3b/0x66 [49519.600979] napi_complete_done+0x4b/0xa8 [49519.600983] ixgbe_poll+0x90c/0xeaa [49519.600985] net_rx_action+0xd3/0x22d [49519.600988] __do_softirq+0xe4/0x23a [49519.600991] irq_exit+0x4d/0x5b [49519.600992] do_IRQ+0x96/0xae [49519.600996] common_interrupt+0x90/0x90 [49519.600997] [49519.601001] RIP: 0010:do_con_write+0x2d0/0x1b13 [49519.601002] RSP: 0018:c90006d57bc0 EFLAGS: 0282 ORIG_RAX: ffa2 [49519.601004] RAX: 004b RBX: 004b RCX: fffd [49519.601005] RDX: 004d RSI: 0003 RDI: 81e751b8 [49519.601006] RBP: c90006d57c68 R08: R09: [49519.601007] R10: 88086b64f807 R11: 88086d07c100 R12: fffd [49519.601008] R13: 88046c7bd400 R14: 88086b64f800 R15: 004b [49519.601011] ? _raw_spin_lock+0x9/0xb [49519.601013] con_write+0xe/0x20 [49519.601018] n_tty_write+0x101/0x3f5 [49519.601021] ? init_wait_entry+0x29/0x29 [49519.601024] tty_write+0x1a9/0x228 [49519.601026] ? n_tty_flush_buffer+0x4c/0x4c [49519.601029] do_loop_readv_writev+0x6f/0xa1 [49519.601031] do_iter_write+0x8e/0xb8 [49519.601032] vfs_writev+0x77/0xad [49519.601034] ? __vfs_write+0x21/0xa0 [49519.601037] ? __fget+0x25/0x56 [49519.601038] ? __fget_light+0x3b/0x46 [49519.601039] ? __fdget+0xe/0x10 [49519.601040] do_writev+0x4f/0xa1 [49519.601041] ? do_writev+0x4f/0xa1 [49519.601043] SyS_writev+0xb/0xd [49519.601044] entry_SYSCALL_64_fastpath+0x13/0x94 [49519.601046] RIP: 0033:0x7fad66233da9 [49519.601046] RSP: 002b:7fad63a32ac0 EFLAGS: 0293 ORIG_RAX: 0014 [49519.601047] RAX: ffda RBX: 01eddd08 RCX: 7fad66233da9 [49519.601048] RDX: 0001 RSI: 01edeab0 RDI: 0011 [49519.601049] RBP: 01eddd08 R08: R09: 7fad662842d0 [49519.601049] R10: R11: 0293 R12: 7fad63a326f0 [49519.601050] R13: 7fad4c0045c0 R14: 7fad4c0045c0 R15: 01eddcb0 [49519.601051] Code: f6 48 3d 90 00 00 00 74 04 48 8b 70 70 49 8b 84 24 68 02 00 00 48 85 c0 74 0c 48 39 f0 72 24 48 85 f6 75 09 eb 1d 48 85 f6 75 02 <0f> ff 49 8d bc 24 48 04 00 00 48 c1 e6 06 e8 a9 62 ff ff e9 eb [49519.601065] ---[ end trace 8558fb6f1ca3beb0 ]--- W dniu 2017-09-26 o 14:00, Paweł Staszewski pisze: [50102.787542] [ cut here ] [50102.787545] WARNING: CPU: 16 PID: 0 at net/sched/sch_hfsc.c:1385 hfsc_dequeue+0x241/0x269 [50102.787545] Modules linked in: ipmi_si x86_pkg_temp_thermal [50102.787547] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W 4.14.0-rc1+ #10 [50102.787548] task: 88046d44 task.stack: c900032e [50102.787549] RIP: 0010:hfsc_dequeue+0x241/0x269 [50102.78755
Latest kernel net-next - 4.14-rc1+ / WARNING: CPU: 16 PID: 0 at net/sched/sch_hfsc.c:1385 hfsc_dequeue+0x241/0x269
[50102.787542] [ cut here ] [50102.787545] WARNING: CPU: 16 PID: 0 at net/sched/sch_hfsc.c:1385 hfsc_dequeue+0x241/0x269 [50102.787545] Modules linked in: ipmi_si x86_pkg_temp_thermal [50102.787547] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W 4.14.0-rc1+ #10 [50102.787548] task: 88046d44 task.stack: c900032e [50102.787549] RIP: 0010:hfsc_dequeue+0x241/0x269 [50102.787550] RSP: 0018:88046fc83eb0 EFLAGS: 00010246 [50102.787551] RAX: RBX: 880456309948 RCX: 0018 [50102.787551] RDX: RSI: RDI: 880456309c40 [50102.787552] RBP: 88046fc83ed8 R08: 0001c000 R09: 0100 [50102.787553] R10: 88046fc83e98 R11: 0003 R12: 880456309800 [50102.787553] R13: 00b6459156dd R14: 0001 R15: 880456309c40 [50102.787554] FS: () GS:88046fc8() knlGS: [50102.787555] CS: 0010 DS: ES: CR0: 80050033 [50102.787556] CR2: 7fc764f21090 CR3: 00085844a000 CR4: 001606e0 [50102.787556] Call Trace: [50102.787557] [50102.787558] __qdisc_run+0xed/0x293 [50102.787560] net_tx_action+0xeb/0x18b [50102.787562] __do_softirq+0xe4/0x23a [50102.787564] irq_exit+0x4d/0x5b [50102.787565] smp_apic_timer_interrupt+0xc0/0xfa [50102.787566] apic_timer_interrupt+0x90/0xa0 [50102.787566] [50102.787568] RIP: 0010:cpuidle_enter_state+0x134/0x189 [50102.787569] RSP: 0018:c900032e3ea0 EFLAGS: 0246 ORIG_RAX: ff10 [50102.787570] RAX: 2d9176d7d9f4 RBX: 0002 RCX: 001f [50102.787570] RDX: RSI: 0010 RDI: [50102.787571] RBP: c900032e3ed0 R08: ffd8 R09: 0003 [50102.787572] R10: c900032e3e70 R11: 88046fc98e50 R12: 88046c234400 [50102.787572] R13: 2d9176d7d9f4 R14: 0002 R15: 2d9176d6e845 [50102.787575] cpuidle_enter+0x12/0x14 [50102.787576] do_idle+0x113/0x16b [50102.787578] cpu_startup_entry+0x1a/0x1f [50102.787580] start_secondary+0xea/0xed [50102.787581] secondary_startup_64+0xa5/0xa5 [50102.787582] Code: f6 48 3d 90 00 00 00 74 04 48 8b 70 70 49 8b 84 24 68 02 00 00 48 85 c0 74 0c 48 39 f0 72 24 48 85 f6 75 09 eb 1d 48 85 f6 75 02 <0f> ff 49 8d bc 24 48 04 00 00 48 c1 e6 06 e8 a9 62 ff ff e9 eb [50102.787602] ---[ end trace 8558fb6f1ca3beb2 ]---
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
W dniu 2017-09-21 o 23:41, Florian Fainelli pisze: On 09/21/2017 02:26 PM, Paweł Staszewski wrote: W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze: diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c index 5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7 100644 --- a/net/8021q/vlan_netlink.c +++ b/net/8021q/vlan_netlink.c @@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, struct net_device *dev, vlan->vlan_proto = proto; vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]); vlan->real_dev = real_dev; +dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE); vlan->flags = VLAN_FLAG_REORDER_HDR; err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id); Any plans for this patch to go normal into the kernel ? Would not this apply to pretty much any stacked device setup though? It seems like any network device that just queues up its packet on another physical device for actual transmission may need that (e.g: DSA, bond, team, more.?) Some devices libe bond have it. Just maybee when there was first patch vlans were not taken into account. Did not checked all :) But I know Eric will do :)
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
W dniu 2017-09-21 o 23:34, Eric Dumazet pisze: On Thu, 2017-09-21 at 23:26 +0200, Paweł Staszewski wrote: W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze: diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c index 5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7 100644 --- a/net/8021q/vlan_netlink.c +++ b/net/8021q/vlan_netlink.c @@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, struct net_device *dev, vlan->vlan_proto = proto; vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]); vlan->real_dev = real_dev; +dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE); vlan->flags = VLAN_FLAG_REORDER_HDR; err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id); Any plans for this patch to go normal into the kernel ? So far im using it for about 3 weeks on all my linux based routers - and no problems. Yes, I was about to submit it, as I mentioned it few hours ago to you ;) Yes i saw Your point 2) in previous emails :) But there was no patch in previous reply for this so was thinking that maybee too many things to do and You forgot about it :) Thanks Paweł
Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze: diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c index 5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7 100644 --- a/net/8021q/vlan_netlink.c +++ b/net/8021q/vlan_netlink.c @@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, struct net_device *dev, vlan->vlan_proto = proto; vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]); vlan->real_dev = real_dev; + dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE); vlan->flags = VLAN_FLAG_REORDER_HDR; err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id); Any plans for this patch to go normal into the kernel ? So far im using it for about 3 weeks on all my linux based routers - and no problems.
Re: Latest net-next from GIT panic
W dniu 2017-09-21 o 13:31, Paweł Staszewski pisze: W dniu 2017-09-21 o 13:03, Eric Dumazet pisze: OK we have two problems here 1) We need to unify skb_dst_force() ( for net tree ) 2) Vlan devices should try to correctly handle IFF_XMIT_DST_RELEASE from lower device. This will considerably help your performance. For 1), this is what I had in mind, can you try it ? Thanks a lot ! diff --git a/include/net/dst.h b/include/net/dst.h index 93568bd0a3520bb7402f04d90cf04ac99c81cfbe..f23851eeaad917e8dafc06b58d23a2575405c894 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry *dst, unsigned long time) static inline struct dst_entry *dst_clone(struct dst_entry *dst) { if (dst) - atomic_inc(&dst->__refcnt); + dst_hold(dst); return dst; } @@ -311,21 +311,6 @@ static inline void skb_dst_copy(struct sk_buff *nskb, const struct sk_buff *oskb __skb_dst_copy(nskb, oskb->_skb_refdst); } -/** - * skb_dst_force - makes sure skb dst is refcounted - * @skb: buffer - * - * If dst is not yet refcounted, let's do it - */ -static inline void skb_dst_force(struct sk_buff *skb) -{ - if (skb_dst_is_noref(skb)) { - WARN_ON(!rcu_read_lock_held()); - skb->_skb_refdst &= ~SKB_DST_NOREF; - dst_clone(skb_dst(skb)); - } -} - /** * dst_hold_safe - Take a reference on a dst if possible * @dst: pointer to dst entry @@ -356,6 +341,23 @@ static inline void skb_dst_force_safe(struct sk_buff *skb) } } +/** + * skb_dst_force - makes sure skb dst is refcounted + * @skb: buffer + * + * If dst is not yet refcounted, let's do it + */ +static inline void skb_dst_force(struct sk_buff *skb) +{ + if (skb_dst_is_noref(skb)) { + struct dst_entry *dst = skb_dst(skb); + + WARN_ON(!rcu_read_lock_held()); + if (!dst_hold_safe(dst)) + dst = NULL; + skb->_skb_refdst = (unsigned long)dst; + } +} /** * __skb_tunnel_rx - prepare skb for rx reinsert Patch applied - soo far no problems - and no warnings in dmesg ok after adding patch all is working from now for about 1 hour of normal traffic witc all bgp sessions connected and about 600k prefixes in kernel.
Re: Latest net-next from GIT panic
W dniu 2017-09-21 o 13:03, Eric Dumazet pisze: OK we have two problems here 1) We need to unify skb_dst_force() ( for net tree ) 2) Vlan devices should try to correctly handle IFF_XMIT_DST_RELEASE from lower device. This will considerably help your performance. For 1), this is what I had in mind, can you try it ? Thanks a lot ! diff --git a/include/net/dst.h b/include/net/dst.h index 93568bd0a3520bb7402f04d90cf04ac99c81cfbe..f23851eeaad917e8dafc06b58d23a2575405c894 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry *dst, unsigned long time) static inline struct dst_entry *dst_clone(struct dst_entry *dst) { if (dst) - atomic_inc(&dst->__refcnt); + dst_hold(dst); return dst; } @@ -311,21 +311,6 @@ static inline void skb_dst_copy(struct sk_buff *nskb, const struct sk_buff *oskb __skb_dst_copy(nskb, oskb->_skb_refdst); } -/** - * skb_dst_force - makes sure skb dst is refcounted - * @skb: buffer - * - * If dst is not yet refcounted, let's do it - */ -static inline void skb_dst_force(struct sk_buff *skb) -{ - if (skb_dst_is_noref(skb)) { - WARN_ON(!rcu_read_lock_held()); - skb->_skb_refdst &= ~SKB_DST_NOREF; - dst_clone(skb_dst(skb)); - } -} - /** * dst_hold_safe - Take a reference on a dst if possible * @dst: pointer to dst entry @@ -356,6 +341,23 @@ static inline void skb_dst_force_safe(struct sk_buff *skb) } } +/** + * skb_dst_force - makes sure skb dst is refcounted + * @skb: buffer + * + * If dst is not yet refcounted, let's do it + */ +static inline void skb_dst_force(struct sk_buff *skb) +{ + if (skb_dst_is_noref(skb)) { + struct dst_entry *dst = skb_dst(skb); + + WARN_ON(!rcu_read_lock_held()); + if (!dst_hold_safe(dst)) + dst = NULL; + skb->_skb_refdst = (unsigned long)dst; + } +} /** *__skb_tunnel_rx - prepare skb for rx reinsert Patch applied - soo far no problems - and no warnings in dmesg
Re: Latest net-next from GIT panic
W dniu 2017-09-21 o 13:12, Paweł Staszewski pisze: W dniu 2017-09-21 o 13:03, Eric Dumazet pisze: On Thu, 2017-09-21 at 11:06 +0200, Paweł Staszewski wrote: W dniu 2017-09-21 o 03:17, Eric Dumazet pisze: On Wed, 2017-09-20 at 18:09 -0700, Wei Wang wrote: Thanks very much Pawel for the feedback. I was looking into the code (specifically IPv4 part) and found that in free_fib_info_rcu(), we call free_nh_exceptions() without holding the fnhe_lock. I am wondering if that could cause some race condition on fnhe->fnhe_rth_input/output so a double call on dst_dev_put() on the same dst could be happening. But as we call free_fib_info_rcu() only after the grace period, and the lookup code which could potentially modify fnhe->fnhe_rth_input/output all holds rcu_read_lock(), it seems fine... Hi Pawel, Could you try the following debug patch on top of net-next branch and reproduce the issue check if there are warning msg showing? diff --git a/include/net/dst.h b/include/net/dst.h index 93568bd0a352..82aff41c6f63 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry *dst, unsigned long time) static inline struct dst_entry *dst_clone(struct dst_entry *dst) { if (dst) - atomic_inc(&dst->__refcnt); + dst_hold(dst); return dst; } Thanks. Wei Yes, we believe skb_dst_force() and skb_dst_force_safe() should be unified (to the 'safe' version) We no longer have gc to protect from 0 -> 1 transition of dst refcount. After adding patch from Wei https://bugzilla.kernel.org/show_bug.cgi?id=197005#c14 OK we have two problems here 1) We need to unify skb_dst_force() ( for net tree ) 2) Vlan devices should try to correctly handle IFF_XMIT_DST_RELEASE from lower device. This will considerably help your performance. For 1), this is what I had in mind, can you try it ? Thanks a lot ! diff --git a/include/net/dst.h b/include/net/dst.h index 93568bd0a3520bb7402f04d90cf04ac99c81cfbe..f23851eeaad917e8dafc06b58d23a2575405c894 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry *dst, unsigned long time) static inline struct dst_entry *dst_clone(struct dst_entry *dst) { if (dst) - atomic_inc(&dst->__refcnt); + dst_hold(dst); return dst; } @@ -311,21 +311,6 @@ static inline void skb_dst_copy(struct sk_buff *nskb, const struct sk_buff *oskb __skb_dst_copy(nskb, oskb->_skb_refdst); } -/** - * skb_dst_force - makes sure skb dst is refcounted - * @skb: buffer - * - * If dst is not yet refcounted, let's do it - */ -static inline void skb_dst_force(struct sk_buff *skb) -{ - if (skb_dst_is_noref(skb)) { - WARN_ON(!rcu_read_lock_held()); - skb->_skb_refdst &= ~SKB_DST_NOREF; - dst_clone(skb_dst(skb)); - } -} - /** * dst_hold_safe - Take a reference on a dst if possible * @dst: pointer to dst entry @@ -356,6 +341,23 @@ static inline void skb_dst_force_safe(struct sk_buff *skb) } } +/** + * skb_dst_force - makes sure skb dst is refcounted + * @skb: buffer + * + * If dst is not yet refcounted, let's do it + */ +static inline void skb_dst_force(struct sk_buff *skb) +{ + if (skb_dst_is_noref(skb)) { + struct dst_entry *dst = skb_dst(skb); + + WARN_ON(!rcu_read_lock_held()); + if (!dst_hold_safe(dst)) + dst = NULL; + skb->_skb_refdst = (unsigned long)dst; + } +} /** * __skb_tunnel_rx - prepare skb for rx reinsert Thanks What is weird i have this part in my net-next from git: /** * skb_dst_force_safe - makes sure skb dst is refcounted * @skb: buffer * * If dst is not yet refcounted and not destroyed, grab a ref on it. */ static inline void skb_dst_force_safe(struct sk_buff *skb) { if (skb_dst_is_noref(skb)) { struct dst_entry *dst = skb_dst(skb); if (!dst_hold_safe(dst)) dst = NULL; skb->_skb_refdst = (unsigned long)dst; } } ok the difference is skb_dst_force_safe not skb_dst_force
Re: Latest net-next from GIT panic
W dniu 2017-09-21 o 13:03, Eric Dumazet pisze: On Thu, 2017-09-21 at 11:06 +0200, Paweł Staszewski wrote: W dniu 2017-09-21 o 03:17, Eric Dumazet pisze: On Wed, 2017-09-20 at 18:09 -0700, Wei Wang wrote: Thanks very much Pawel for the feedback. I was looking into the code (specifically IPv4 part) and found that in free_fib_info_rcu(), we call free_nh_exceptions() without holding the fnhe_lock. I am wondering if that could cause some race condition on fnhe->fnhe_rth_input/output so a double call on dst_dev_put() on the same dst could be happening. But as we call free_fib_info_rcu() only after the grace period, and the lookup code which could potentially modify fnhe->fnhe_rth_input/output all holds rcu_read_lock(), it seems fine... Hi Pawel, Could you try the following debug patch on top of net-next branch and reproduce the issue check if there are warning msg showing? diff --git a/include/net/dst.h b/include/net/dst.h index 93568bd0a352..82aff41c6f63 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry *dst, unsigned long time) static inline struct dst_entry *dst_clone(struct dst_entry *dst) { if (dst) - atomic_inc(&dst->__refcnt); + dst_hold(dst); return dst; } Thanks. Wei Yes, we believe skb_dst_force() and skb_dst_force_safe() should be unified (to the 'safe' version) We no longer have gc to protect from 0 -> 1 transition of dst refcount. After adding patch from Wei https://bugzilla.kernel.org/show_bug.cgi?id=197005#c14 OK we have two problems here 1) We need to unify skb_dst_force() ( for net tree ) 2) Vlan devices should try to correctly handle IFF_XMIT_DST_RELEASE from lower device. This will considerably help your performance. For 1), this is what I had in mind, can you try it ? Thanks a lot ! diff --git a/include/net/dst.h b/include/net/dst.h index 93568bd0a3520bb7402f04d90cf04ac99c81cfbe..f23851eeaad917e8dafc06b58d23a2575405c894 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry *dst, unsigned long time) static inline struct dst_entry *dst_clone(struct dst_entry *dst) { if (dst) - atomic_inc(&dst->__refcnt); + dst_hold(dst); return dst; } @@ -311,21 +311,6 @@ static inline void skb_dst_copy(struct sk_buff *nskb, const struct sk_buff *oskb __skb_dst_copy(nskb, oskb->_skb_refdst); } -/** - * skb_dst_force - makes sure skb dst is refcounted - * @skb: buffer - * - * If dst is not yet refcounted, let's do it - */ -static inline void skb_dst_force(struct sk_buff *skb) -{ - if (skb_dst_is_noref(skb)) { - WARN_ON(!rcu_read_lock_held()); - skb->_skb_refdst &= ~SKB_DST_NOREF; - dst_clone(skb_dst(skb)); - } -} - /** * dst_hold_safe - Take a reference on a dst if possible * @dst: pointer to dst entry @@ -356,6 +341,23 @@ static inline void skb_dst_force_safe(struct sk_buff *skb) } } +/** + * skb_dst_force - makes sure skb dst is refcounted + * @skb: buffer + * + * If dst is not yet refcounted, let's do it + */ +static inline void skb_dst_force(struct sk_buff *skb) +{ + if (skb_dst_is_noref(skb)) { + struct dst_entry *dst = skb_dst(skb); + + WARN_ON(!rcu_read_lock_held()); + if (!dst_hold_safe(dst)) + dst = NULL; + skb->_skb_refdst = (unsigned long)dst; + } +} /** *__skb_tunnel_rx - prepare skb for rx reinsert Thanks What is weird i have this part in my net-next from git: /** * skb_dst_force_safe - makes sure skb dst is refcounted * @skb: buffer * * If dst is not yet refcounted and not destroyed, grab a ref on it. */ static inline void skb_dst_force_safe(struct sk_buff *skb) { if (skb_dst_is_noref(skb)) { struct dst_entry *dst = skb_dst(skb); if (!dst_hold_safe(dst)) dst = NULL; skb->_skb_refdst = (unsigned long)dst; } }
Re: Latest net-next from GIT panic
W dniu 2017-09-21 o 03:17, Eric Dumazet pisze: On Wed, 2017-09-20 at 18:09 -0700, Wei Wang wrote: Thanks very much Pawel for the feedback. I was looking into the code (specifically IPv4 part) and found that in free_fib_info_rcu(), we call free_nh_exceptions() without holding the fnhe_lock. I am wondering if that could cause some race condition on fnhe->fnhe_rth_input/output so a double call on dst_dev_put() on the same dst could be happening. But as we call free_fib_info_rcu() only after the grace period, and the lookup code which could potentially modify fnhe->fnhe_rth_input/output all holds rcu_read_lock(), it seems fine... Hi Pawel, Could you try the following debug patch on top of net-next branch and reproduce the issue check if there are warning msg showing? diff --git a/include/net/dst.h b/include/net/dst.h index 93568bd0a352..82aff41c6f63 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry *dst, unsigned long time) static inline struct dst_entry *dst_clone(struct dst_entry *dst) { if (dst) - atomic_inc(&dst->__refcnt); + dst_hold(dst); return dst; } Thanks. Wei Yes, we believe skb_dst_force() and skb_dst_force_safe() should be unified (to the 'safe' version) We no longer have gc to protect from 0 -> 1 transition of dst refcount. After adding patch from Wei https://bugzilla.kernel.org/show_bug.cgi?id=197005#c14
Re: Latest net-next from GIT panic
W dniu 2017-09-20 o 23:25, Paweł Staszewski pisze: W dniu 2017-09-20 o 23:24, Paweł Staszewski pisze: W dniu 2017-09-20 o 23:10, Paweł Staszewski pisze: W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze: W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze: W dniu 2017-09-20 o 20:36, Cong Wang pisze: On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet wrote: On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote: but dmesg at this time shows nothing about interfaces or flaps. This is very odd. We only free netdevice in free_netdev() and it is only called when we unregister a netdevice. Otherwise pcpu_refcnt is impossible to be NULL. If there is a missing dev_hold() or one dev_put() in excess, this would allow the netdev to be freed too soon. -> Use after free. memory holding netdev could be reallocated-cleared by some other kernel user. Sure, but only unregister could trigger a free. If there is no unregister, like what Pawel claims, then there is no free, the refcnt just goes to 0 but the memory is still there. About possible mistake from my side with bisect - i can judge too early that some bisect was good the road was: git bisect start # bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 'pinctrl-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075 # good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31 # bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid using stack larger than 1024. git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f # good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 'udp-reduce-cache-pressure' git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20 # bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 's390-net-updates-part-2' git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230 # good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 'bpf-ctx-narrow' git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70 # good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove cp_outgoing git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2 # bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add TCP_MD5SIG_EXT socket option to set a key address prefix git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d # good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a new function dst_dev_put() And currently have this running for about 4 hours without problems. git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36 # bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove DST_NOCACHE flag Here for sure - panic git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2 # bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call dst_hold_safe() properly git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112 # good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call dst_hold_safe() properly git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f # bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt for insertion into fib6 tree im not 100% sure tor last two Will test them again starting from [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() properly git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855 # bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and remove the operation of dst_free() git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911 # first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and remove the operation of dst_free() What i can say more I can reproduce this on any server with similar configuration the difference can be teamd instead of bonding ixgbe or i40e and mlx5 Same problems vlans - more or less prefixes learned from bgp -> zebra -> netlink -> kernel But normally in lab when using only plain routing no bgpd and about 128 vlans - with 128 routes - cant reproduce this - this apperas only with bgp - minimum where i can reproduce this was about 130k prefixes with about 286 nexthops bisected again and same result: b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit commit b838d5e1c5b6e57b10ec8af2268824041e3ea911 Author: Wei Wang Date: Sat Jun 17 10:42:32 2017 -0700 ipv4: mark DST_NOGC and remove the operation of dst_free() With the previous preparation patches, we are ready to get rid of the dst gc operation in ipv4 code and release dst based on refcnt only. So this patch adds DST_NOGC flag for all IPv4 dst and remove the calls to dst_free(). At this point, all dst created in ipv4 code do not use the dst gc anymore and will be destroyed at the point when refcnt drops to 0. Signed
Re: Latest net-next from GIT panic
W dniu 2017-09-20 o 23:24, Paweł Staszewski pisze: W dniu 2017-09-20 o 23:10, Paweł Staszewski pisze: W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze: W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze: W dniu 2017-09-20 o 20:36, Cong Wang pisze: On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet wrote: On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote: but dmesg at this time shows nothing about interfaces or flaps. This is very odd. We only free netdevice in free_netdev() and it is only called when we unregister a netdevice. Otherwise pcpu_refcnt is impossible to be NULL. If there is a missing dev_hold() or one dev_put() in excess, this would allow the netdev to be freed too soon. -> Use after free. memory holding netdev could be reallocated-cleared by some other kernel user. Sure, but only unregister could trigger a free. If there is no unregister, like what Pawel claims, then there is no free, the refcnt just goes to 0 but the memory is still there. About possible mistake from my side with bisect - i can judge too early that some bisect was good the road was: git bisect start # bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 'pinctrl-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075 # good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31 # bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid using stack larger than 1024. git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f # good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 'udp-reduce-cache-pressure' git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20 # bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 's390-net-updates-part-2' git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230 # good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 'bpf-ctx-narrow' git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70 # good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove cp_outgoing git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2 # bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add TCP_MD5SIG_EXT socket option to set a key address prefix git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d # good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a new function dst_dev_put() And currently have this running for about 4 hours without problems. git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36 # bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove DST_NOCACHE flag Here for sure - panic git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2 # bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call dst_hold_safe() properly git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112 # good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call dst_hold_safe() properly git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f # bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt for insertion into fib6 tree im not 100% sure tor last two Will test them again starting from [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() properly git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855 # bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and remove the operation of dst_free() git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911 # first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and remove the operation of dst_free() What i can say more I can reproduce this on any server with similar configuration the difference can be teamd instead of bonding ixgbe or i40e and mlx5 Same problems vlans - more or less prefixes learned from bgp -> zebra -> netlink -> kernel But normally in lab when using only plain routing no bgpd and about 128 vlans - with 128 routes - cant reproduce this - this apperas only with bgp - minimum where i can reproduce this was about 130k prefixes with about 286 nexthops bisected again and same result: b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit commit b838d5e1c5b6e57b10ec8af2268824041e3ea911 Author: Wei Wang Date: Sat Jun 17 10:42:32 2017 -0700 ipv4: mark DST_NOGC and remove the operation of dst_free() With the previous preparation patches, we are ready to get rid of the dst gc operation in ipv4 code and release dst based on refcnt only. So this patch adds DST_NOGC flag for all IPv4 dst and remove the calls to dst_free(). At this point, all dst created in ipv4 code do not use the dst gc anymore and will be destroyed at the point when refcnt drops to 0. Signed-off-by: Wei Wang Acked-by: Mart
Re: Latest net-next from GIT panic
W dniu 2017-09-20 o 23:10, Paweł Staszewski pisze: W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze: W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze: W dniu 2017-09-20 o 20:36, Cong Wang pisze: On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet wrote: On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote: but dmesg at this time shows nothing about interfaces or flaps. This is very odd. We only free netdevice in free_netdev() and it is only called when we unregister a netdevice. Otherwise pcpu_refcnt is impossible to be NULL. If there is a missing dev_hold() or one dev_put() in excess, this would allow the netdev to be freed too soon. -> Use after free. memory holding netdev could be reallocated-cleared by some other kernel user. Sure, but only unregister could trigger a free. If there is no unregister, like what Pawel claims, then there is no free, the refcnt just goes to 0 but the memory is still there. About possible mistake from my side with bisect - i can judge too early that some bisect was good the road was: git bisect start # bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 'pinctrl-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075 # good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31 # bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid using stack larger than 1024. git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f # good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 'udp-reduce-cache-pressure' git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20 # bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 's390-net-updates-part-2' git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230 # good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 'bpf-ctx-narrow' git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70 # good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove cp_outgoing git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2 # bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add TCP_MD5SIG_EXT socket option to set a key address prefix git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d # good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a new function dst_dev_put() And currently have this running for about 4 hours without problems. git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36 # bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove DST_NOCACHE flag Here for sure - panic git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2 # bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call dst_hold_safe() properly git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112 # good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call dst_hold_safe() properly git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f # bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt for insertion into fib6 tree im not 100% sure tor last two Will test them again starting from [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() properly git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855 # bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and remove the operation of dst_free() git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911 # first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and remove the operation of dst_free() What i can say more I can reproduce this on any server with similar configuration the difference can be teamd instead of bonding ixgbe or i40e and mlx5 Same problems vlans - more or less prefixes learned from bgp -> zebra -> netlink -> kernel But normally in lab when using only plain routing no bgpd and about 128 vlans - with 128 routes - cant reproduce this - this apperas only with bgp - minimum where i can reproduce this was about 130k prefixes with about 286 nexthops bisected again and same result: b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit commit b838d5e1c5b6e57b10ec8af2268824041e3ea911 Author: Wei Wang Date: Sat Jun 17 10:42:32 2017 -0700 ipv4: mark DST_NOGC and remove the operation of dst_free() With the previous preparation patches, we are ready to get rid of the dst gc operation in ipv4 code and release dst based on refcnt only. So this patch adds DST_NOGC flag for all IPv4 dst and remove the calls to dst_free(). At this point, all dst created in ipv4 code do not use the dst gc anymore and will be destroyed at the point when refcnt drops to 0. Signed-off-by: Wei Wang Acked-by: Martin KaFai Lau Signed-off-by: David S.
Re: Latest net-next from GIT panic
W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze: W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze: W dniu 2017-09-20 o 20:36, Cong Wang pisze: On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet wrote: On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote: but dmesg at this time shows nothing about interfaces or flaps. This is very odd. We only free netdevice in free_netdev() and it is only called when we unregister a netdevice. Otherwise pcpu_refcnt is impossible to be NULL. If there is a missing dev_hold() or one dev_put() in excess, this would allow the netdev to be freed too soon. -> Use after free. memory holding netdev could be reallocated-cleared by some other kernel user. Sure, but only unregister could trigger a free. If there is no unregister, like what Pawel claims, then there is no free, the refcnt just goes to 0 but the memory is still there. About possible mistake from my side with bisect - i can judge too early that some bisect was good the road was: git bisect start # bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 'pinctrl-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075 # good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31 # bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid using stack larger than 1024. git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f # good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 'udp-reduce-cache-pressure' git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20 # bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 's390-net-updates-part-2' git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230 # good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 'bpf-ctx-narrow' git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70 # good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove cp_outgoing git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2 # bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add TCP_MD5SIG_EXT socket option to set a key address prefix git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d # good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a new function dst_dev_put() And currently have this running for about 4 hours without problems. git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36 # bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove DST_NOCACHE flag Here for sure - panic git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2 # bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call dst_hold_safe() properly git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112 # good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call dst_hold_safe() properly git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f # bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt for insertion into fib6 tree im not 100% sure tor last two Will test them again starting from [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() properly git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855 # bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and remove the operation of dst_free() git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911 # first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and remove the operation of dst_free() What i can say more I can reproduce this on any server with similar configuration the difference can be teamd instead of bonding ixgbe or i40e and mlx5 Same problems vlans - more or less prefixes learned from bgp -> zebra -> netlink -> kernel But normally in lab when using only plain routing no bgpd and about 128 vlans - with 128 routes - cant reproduce this - this apperas only with bgp - minimum where i can reproduce this was about 130k prefixes with about 286 nexthops bisected again and same result: b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit commit b838d5e1c5b6e57b10ec8af2268824041e3ea911 Author: Wei Wang Date: Sat Jun 17 10:42:32 2017 -0700 ipv4: mark DST_NOGC and remove the operation of dst_free() With the previous preparation patches, we are ready to get rid of the dst gc operation in ipv4 code and release dst based on refcnt only. So this patch adds DST_NOGC flag for all IPv4 dst and remove the calls to dst_free(). At this point, all dst created in ipv4 code do not use the dst gc anymore and will be destroyed at the point when refcnt drops to 0. Signed-off-by: Wei Wang Acked-by: Martin KaFai Lau Signed-off-by: David S. Miller :04 04 9b7e7fb641de6531fc7887473ca47ef7cb6a11da 831a73b71d3df1755f3e
Re: Latest net-next from GIT panic
W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze: W dniu 2017-09-20 o 20:36, Cong Wang pisze: On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet wrote: On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote: but dmesg at this time shows nothing about interfaces or flaps. This is very odd. We only free netdevice in free_netdev() and it is only called when we unregister a netdevice. Otherwise pcpu_refcnt is impossible to be NULL. If there is a missing dev_hold() or one dev_put() in excess, this would allow the netdev to be freed too soon. -> Use after free. memory holding netdev could be reallocated-cleared by some other kernel user. Sure, but only unregister could trigger a free. If there is no unregister, like what Pawel claims, then there is no free, the refcnt just goes to 0 but the memory is still there. About possible mistake from my side with bisect - i can judge too early that some bisect was good the road was: git bisect start # bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 'pinctrl-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075 # good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31 # bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid using stack larger than 1024. git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f # good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 'udp-reduce-cache-pressure' git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20 # bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 's390-net-updates-part-2' git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230 # good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 'bpf-ctx-narrow' git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70 # good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove cp_outgoing git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2 # bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add TCP_MD5SIG_EXT socket option to set a key address prefix git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d # good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a new function dst_dev_put() And currently have this running for about 4 hours without problems. git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36 # bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove DST_NOCACHE flag Here for sure - panic git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2 # bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call dst_hold_safe() properly git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112 # good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call dst_hold_safe() properly git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f # bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt for insertion into fib6 tree im not 100% sure tor last two Will test them again starting from [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() properly git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855 # bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and remove the operation of dst_free() git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911 # first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and remove the operation of dst_free() What i can say more I can reproduce this on any server with similar configuration the difference can be teamd instead of bonding ixgbe or i40e and mlx5 Same problems vlans - more or less prefixes learned from bgp -> zebra -> netlink -> kernel But normally in lab when using only plain routing no bgpd and about 128 vlans - with 128 routes - cant reproduce this - this apperas only with bgp - minimum where i can reproduce this was about 130k prefixes with about 286 nexthops