Re: 100% CPU load when generating traffic to destination network that nexthop is not reachable

Paweł Staszewski Thu, 17 Aug 2017 05:53:54 -0700

Hi


Wondering if someone have idea how to optimise this ?

From reali life perspective it is really important for optimising thisbehavior cause imagine situation - just normal situation with linuxacting as a router:

Lets say we have Linux router with connected 3 customers and oneupstream - and have some situation with ddos

1. There is comming ddos traffic from upstream (and lets say we arehandling this ddos on our forwarding router with 50% cpu load) to ourrouter and router is forwarding this traffic to some customer

(remember we have 3 customers - one is getting ddos now dirested to someof his ip)

2. Customer X is getting ddos - his router is 100% cpu load (cause lowend hardware or 100% load on up-link.

3. Customer X router is not responding or some watchdog restarted it -or customer X disabled router for security reasons

4. After fdb expires on our service router - two other customers startto have problems cause our router goes from 50% to 100% on all cores -and everybody experiencing now packet drops and bandwidth drops

And this does'nt need to be ddos - it can be many servers connected bylinux router and one server will push some udp stream (iptv or otherfilesystem syncing protocol) to the other via forwarding linux router -if receiving server will goes down and dissapear from arp all otherstreams that are forwarded by linux router will suffer from this.




W dniu 2017-08-16 o 12:07, Paweł Staszewski pisze:

Hi

Patch applied - but no big change - from 0.7Mpps per vlan to 1.2Mppsper vlan


previously(without patch) 100% cpu load:

  bwm-ng v0.6.1 (probing every 0.500s), press 'h' for help
  input: /proc/net/dev type: rate
  |         iface                   Rx Tx                Total

==============================================================================vlan1002: 0.00 P/s 1.99P/s 1.99 P/s

         vlan1001:            0.00 P/s        717227.12 P/s 717227.12 P/s

enp175s0f0: 2713679.25 P/s 0.00 P/s 2713679.25P/s

         vlan1000:            0.00 P/s        716145.44 P/s 716145.44 P/s

------------------------------------------------------------------------------total: 2713679.25 P/s 1433374.50 P/s 4147054.00P/s



With patch (100% cpu load a little better pps performance)

 bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
  input: /proc/net/dev type: rate
  |         iface                   Rx Tx                Total

==============================================================================vlan1002: 0.00 P/s 1.00P/s 1.00 P/svlan1001: 0.00 P/s 1202161.50 P/s 1202161.50P/senp175s0f0: 3699864.50 P/s 0.00 P/s 3699864.50P/svlan1000: 0.00 P/s 1196870.38 P/s 1196870.38P/s------------------------------------------------------------------------------total: 3699864.50 P/s 2399033.00 P/s 6098897.50P/s



perf top attached below:

     1.90%     0.00%  ksoftirqd/39    [kernel.vmlinux] [k] run_ksoftirqd
            |
             --1.90%--run_ksoftirqd
                       |
                        --1.90%--__softirqentry_text_start
                                  |
                                   --1.90%--net_rx_action
                                             |
--1.90%--mlx5e_napi_poll
                                                        |
--1.89%--mlx5e_poll_rx_cq
|
--1.88%--mlx5e_handle_rx_cqe
|
--1.85%--napi_gro_receive
|
--1.85%--netif_receive_skb_internal
|
--1.85%--__netif_receive_skb
|
--1.85%--__netif_receive_skb_core
|
--1.85%--ip_rcv
|
--1.85%--ip_rcv_finish
|
--1.83%--ip_forward
|
--1.82%--ip_forward_finish
|
--1.82%--ip_output
|
--1.82%--ip_finish_output
|
--1.82%--ip_finish_output2
|
--1.79%--neigh_resolve_output
|
--1.77%--neigh_event_send
|
--1.77%--__neigh_event_send
|
--1.74%--_raw_write_lock_bh
|
--1.74%--queued_write_lock
queued_write_lock_slowpath
|
--1.70%--queued_spin_lock_slowpath

1.90% 0.00% ksoftirqd/34 [kernel.vmlinux] [k]__softirqentry_text_start

            |
            ---__softirqentry_text_start
               |
                --1.90%--net_rx_action
                          |
                           --1.90%--mlx5e_napi_poll
                                     |
                                      --1.89%--mlx5e_poll_rx_cq
                                                |
--1.88%--mlx5e_handle_rx_cqe
                                                           |
--1.86%--napi_gro_receive
                                                                      |
--1.85%--netif_receive_skb_internal
|
--1.85%--__netif_receive_skb
|
--1.85%--__netif_receive_skb_core
|
--1.85%--ip_rcv
|
--1.85%--ip_rcv_finish
|
--1.83%--ip_forward
|
--1.82%--ip_forward_finish
|
--1.82%--ip_output
|
--1.82%--ip_finish_output
|
--1.82%--ip_finish_output2
|
--1.79%--neigh_resolve_output
|
--1.77%--neigh_event_send
|
--1.77%--__neigh_event_send
|
--1.74%--_raw_write_lock_bh
queued_write_lock
queued_write_lock_slowpath
|
--1.71%--queued_spin_lock_slowpath

1.85% 0.00% ksoftirqd/38 [kernel.vmlinux] [k]ip_rcv_finish

            |
             --1.85%--ip_rcv_finish
                       |
                        --1.83%--ip_forward
                                  |
                                   --1.82%--ip_forward_finish
                                             |
                                              --1.82%--ip_output
                                                        |
--1.82%--ip_finish_output
|
--1.82%--ip_finish_output2
|
--1.79%--neigh_resolve_output
|
--1.77%--neigh_event_send
|
--1.77%--__neigh_event_send
|
--1.74%--_raw_write_lock_bh
queued_write_lock
queued_write_lock_slowpath
|
--1.71%--queued_spin_lock_slowpath

     1.85%     0.00%  ksoftirqd/22    [kernel.vmlinux] [k] ip_rcv
            |
             --1.85%--ip_rcv
                       |
                        --1.85%--ip_rcv_finish
                                  |
                                   --1.83%--ip_forward
                                             |
--1.82%--ip_forward_finish
                                                        |
--1.82%--ip_output
|
--1.82%--ip_finish_output
|
--1.82%--ip_finish_output2
|
--1.79%--neigh_resolve_output
|
--1.77%--neigh_event_send
|
--1.77%--__neigh_event_send
|
--1.73%--_raw_write_lock_bh
queued_write_lock
queued_write_lock_slowpath
|
--1.70%--queued_spin_lock_slowpath

1.83% 0.00% ksoftirqd/9 [kernel.vmlinux] [k]ip_forward

            |
             --1.83%--ip_forward
                       |
                        --1.82%--ip_forward_finish
                                  |
                                   --1.82%--ip_output
                                             |
--1.82%--ip_finish_output
                                                        |
--1.82%--ip_finish_output2
|
--1.79%--neigh_resolve_output
|
--1.77%--neigh_event_send
|
--1.77%--__neigh_event_send
|
--1.74%--_raw_write_lock_bh
queued_write_lock
queued_write_lock_slowpath
|
--1.70%--queued_spin_lock_slowpath


     1.82%     0.00%  ksoftirqd/35    [kernel.vmlinux] [k] ip_output
            |
             --1.82%--ip_output
                       |
                        --1.82%--ip_finish_output
                                  |
                                   --1.82%--ip_finish_output2
                                             |
--1.79%--neigh_resolve_output
                                                        |
--1.77%--neigh_event_send
|
--1.77%--__neigh_event_send
|
--1.74%--_raw_write_lock_bh
queued_write_lock
queued_write_lock_slowpath
|
--1.71%--queued_spin_lock_slowpath

1.82% 0.00% ksoftirqd/38 [kernel.vmlinux] [k]ip_finish_output

            |
             --1.82%--ip_finish_output
                       |
                        --1.82%--ip_finish_output2
                                  |
                                   --1.79%--neigh_resolve_output
                                             |
--1.77%--neigh_event_send
                                                        |
--1.77%--__neigh_event_send
|
--1.74%--_raw_write_lock_bh
queued_write_lock
queued_write_lock_slowpath
|
--1.71%--queued_spin_lock_slowpath

1.82% 0.00% ksoftirqd/37 [kernel.vmlinux] [k]ip_forward_finish

            |
             --1.82%--ip_forward_finish
                       ip_output
                       |
                        --1.82%--ip_finish_output
                                  |
                                   --1.82%--ip_finish_output2
                                             |
--1.79%--neigh_resolve_output
                                                        |
--1.76%--neigh_event_send
__neigh_event_send
|
--1.73%--_raw_write_lock_bh
queued_write_lock
queued_write_lock_slowpath
|
--1.70%--queued_spin_lock_slowpath


W dniu 2017-08-16 o 09:42, Julian Anastasov pisze:

    Hello,

On Tue, 15 Aug 2017, Eric Dumazet wrote:

It must be possible to add a fast path without locks.

(say if jiffies has not changed before last state change)

    New day - new idea. Something like this? But it
has bug: without checking neigh->dead under lock we don't
have the right to access neigh->parms, it can be destroyed
immediately by neigh_release->neigh_destroy->neigh_parms_put->
neigh_parms_destroy->kfree. Not sure, may be kfree_rcu can help
for this...

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 9816df2..f52763c 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h

@@ -428,10 +428,10 @@ static inline int neigh_event_send(structneighbour *neigh, struct sk_buff *skb)

  {
      unsigned long now = jiffies;

-    if (neigh->used != now)
-        neigh->used = now;
      if (!(neigh->nud_state&(NUD_CONNECTED|NUD_DELAY|NUD_PROBE)))
          return __neigh_event_send(neigh, skb);
+    if (neigh->used != now)
+        neigh->used = now;
      return 0;
  }
  diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 16a1a4c..52a8718 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -991,8 +991,18 @@ static void neigh_timer_handler(unsigned long arg)
    int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
  {
-    int rc;
      bool immediate_probe = false;
+    unsigned long now = jiffies;
+    int rc;
+
+    if (neigh->used != now) {
+        neigh->used = now;
+    } else if (neigh->nud_state == NUD_INCOMPLETE &&
+           (!skb || neigh->arp_queue_len_bytes + skb->truesize >
+            NEIGH_VAR(neigh->parms, QUEUE_LEN_BYTES))) {
+        kfree_skb(skb);
+        return 1;
+    }
        write_lock_bh(&neigh->lock);

@@ -1005,7 +1015,7 @@ int __neigh_event_send(struct neighbour*neigh, struct sk_buff *skb)

      if (!(neigh->nud_state & (NUD_STALE | NUD_INCOMPLETE))) {
          if (NEIGH_VAR(neigh->parms, MCAST_PROBES) +
              NEIGH_VAR(neigh->parms, APP_PROBES)) {
-            unsigned long next, now = jiffies;
+            unsigned long next;
                atomic_set(&neigh->probes,
                     NEIGH_VAR(neigh->parms, UCAST_PROBES));

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: 100% CPU load when generating traffic to destination network that nexthop is not reachable

Reply via email to