[PATCH/RFC] (long) network interface changes

jamal Mon, 18 Sep 2000 18:26:03 -0700



I apologize for the over 10K email.. consider this documentation ;->
This is cross-posted to l-k; i would prefer the discussions on netdev
or cc netdev (i am not subscribed to l-k)

This is a port against 2.4.0-test8 based on the OLS presentation i made 
"Fast Forwarding the Bird" available at:
http://robur.slu.se/Linux/net-development/jamal/FF-html/

There are lotsa improvements since OLS in a collaboration involving 
Robert Olsson and myself. 

Current snapshot is available via ftp from:
ftp://robur.slu.se:/pub/Linux/net-development/FF-bird-current.tgz
This includes a patched tulip driver; tested only on a DEC21143-based
4-port cards.

This RFC at:
ftp://robur.slu.se:/pub/Linux/net-development/README-FF

To make things interesting robur.slu.se is currently running these patches.

Acknowledgements
-----------------

1) Alexey Kuznetsov : Without his FF and FC these thoughts might never have
been born. Alexey is still involved whenever time permits.
2) Robert Olsson : Many insights and current partner-in-crime
3) Donald Becker : Well thought network driver design.

-----------------------------
Test updates September/2000:
-----------------------------

Robert Olson and I decided after the OLS that we were going to try to
hit the 100Mbps(148.8Kpps) routing peak by year end. I am afraid the
bar has been raised. Robert is already hitting with 2.4.0-test7 ~148Kpps
with a ASUS CUBX motherboard carrying PIII 700 MHZ coppermine with 
about 65% CPU utilization.
With a single PII based Dell machine i was able to get a consistent value of
110Kpps.

So the new goal is to go to about 500Kpps ;-> (maybe not by year end, but
surely by that next random Linux hacker conference)

A sample modified tulip driver (hacked by Alexey for 2.2 and mod'ed by Robert
and myself over a period of time) is supplied as an example on how to use the
feedback values.


General blurb
-------------

To begin, i have to say that forwarding 100Mbps of 64byte packets is _not_
a problem at all in Linux; Alexey's Fast Forwarding does a fine job.

FF, like in routers is _not_ subjected to Firewalling (eg CISCO Fast routing).
So the challenge was to try and do it without FF turned on so we could do 
firewalling.

** Very important to note:
Although the tests and improvements were for packet forwarding, the technique
used applies for servers under a heavy load. System congestion is moved down
to the hardware therefore alleviating the system overload. 
I believe we could have done better with the mindcraft tests with these 
changes in 2.2 (and HW FC turned on). 
[Fact is, HW FC was there but i suppose no-one knew you could
use Alexey's hacked version of the Tulip ;->]

The changes proposed below are transparent to drivers that dont use them. 
It is however highly encouraged they take advantage of the supplied 
interface. This does not break anything in 2.4. It is a clean patch.

- This is a first cut, hopefully discussions will ensue and maybe a 
revision of the patch. In particular of interest are the recent weird
requirements by the X.25 code.
Refer to thread on l-k "Re: Q: sock output serialization"
Henner, hoping to hear from you.

- I intend on submitting this patch for inclusion in 2.4 since it is 
non-intrusive. I suspect there will be about one more revision.

*** Proposed changes:

Change 1) 
---------

netif_rx() now returns a value to the driver (change from void).

The queue is divided into 4 configurable threshold regions:
*no congestion zone
*low congestion zone
*moderate congestion
*A high congestion zone.
*A drop zone where packets are dropped.
(the configuration interface is via /proc/sys/net/core/)

A positive value (different for each of the regions) implies that the packet 
was successfully queued whereas a negative value implies it was dropped.

The default congestion threshold values are based on engineering 
experimentation and not on theoretical scientific proofs. There are 
probably better ways of drawing up the associated thresholds.

I would like to add that the HW FC feature is _a neccessary requirement_
to complement this. Maybe i should say this complements HW FC.
If a driver keeps sending packets up to netif_rx() even when its been given 
feedback to stop, HW FC kicks in and the device is shut up; "sent to its room" 
so to speak. So the HW FC is considered the "when all else fails" rule.
A separate document will describe how to use HW FC for driver authors.
I think it should be a *mandantory* interface to drivers.

The netif_rx() feedback helps the driver get 2 insights:
i) understand the fate of the packets it sent up the stack. No point
in continuing to blast packets to the stack when they are being dropped.
(which happens today)
ii) smartly gauge the congestion level on the stack and adjust the
rate at which packets are being sent to the stack to reduce overall
system load. A scheme which moves the congestion away from the system and
onto the driver is considered a bonus. The tulip does this in two ways
selectable at compile: 1) Kill the rx thread or 2) kill the rx_interupt.
Both which work extremely well. 
[The sample tulip driver can be made to use 2) by undefining D_INT in
tulip.h ]

The driver uses the feedback information to intelligently adjust its 
sending rate.  (i.e reduce or increase calls to netif_rx() or send a 
congestion-experienced frame to its peer eg in X.25). 
In the sample tulip driver, dynamic mitigation based on congestion feedback 
is used. Under low congestion, the mitigation parameters are turned off 
(default behavior as is today); under heavy congestion we dynamically move 
up to 16 packet times. It is not mandatory to use this scheme; however, 
it serves as a good example. A word of caution: This scheme is still being
experimented on; we feel we could do better. Look at the code.

change 2) 
---------

The backlog queue is now getting sampled.
This helps in detecting incipient congestion by the top layer. 
Essentially, the sampler is a low-pass filter which weeds out 
"congestion detected" wolf-cries due to sudden short bursts which
fill the backlog. 

I have experimented with two schemes: one which samples the queue via 
a timer and one which does it per-packet and found that the per-packet
sampler gave better results (more samples, Shannon's theorem applies).
It didnt matter whether HZ was 100 or 1024 during the tests.
The measure of "better" was throughput.

change 3) 
---------

Introduce a scheme which does occasionaly send a "random-lie"
when around moderate to high congestions; 
The motivation for this is to improve unfairness issues of many devices
sharing the same backlog queue.
e.g if eth0 is blasting 70Kpps to the backlog queue and eth1 is merely 
sending 10pps, then under the current system setup, eth0 will pretty
much fill up the backlog (every time it gets drained) and have a
very small opportunities to queue for eth1. Imagine scaling this to over 
10 interfaces with eth0 blasting at that rate.

Solution i have devised:
- Randomly lie in the feedback to the driver when under moderate to high
congestion levels (tell device there is a higher congestion than really is).
Theory is that "the harder they come, the harder they fall". It is more
than likely that eth0s packets from the above example will be hit than 
say eth1 by the randomness (simply because there is more of them to take 
target shots at)

My testing with the included scheme (#ifdef RAND_LIE) indicates that fairness 
infact goes up; however, the overall throughput when only one interface
is utilizing the system goes down under heavy to moderate congestion.
I am including it here as a way to highlight the problem. I think there could
be better ways to do this.
Code is included and can be turned on by defining RAND_LIE in dev.c

change 4) 
---------

After a brief talk with Alexey and Robert at the OLS, i am withdrawing this 
change proposal.
Currently after netdev_dropping is raised, all incoming packets to the
backlog are dropped even if there was only a single packet on the queue. 
I had proposed removing that check. In a world with good driver-zens 
(system-zens?) where every driver backs off once system congestion is detected,
the change would make sense. It is, however, unfair to good citizens to backoff
while the bad guys are filling the backlog which is what would happen if the
change is made.


cheers,
jamal

--- linux/net/core/dev.c.orig   Thu Sep  7 11:32:01 2000
+++ linux/net/core/dev.c        Tue Sep 12 20:02:20 2000
@@ -97,6 +97,9 @@
 extern int plip_init(void);
 #endif
 
+/*
+#define RAND_LIE 1
+*/
 NET_PROFILE_DEFINE(dev_queue_xmit)
 NET_PROFILE_DEFINE(softnet_process)
 
@@ -133,6 +136,11 @@
 static struct packet_type *ptype_base[16];             /* 16 way hashed list */
 static struct packet_type *ptype_all = NULL;           /* Taps */
 
+#ifdef offline_sample
+static void sample_queue(unsigned long dummy);
+static struct timer_list samp_timer = { { NULL, NULL }, 0, 0, &sample_queue };
+#endif
+
 /*
  *     Our notifier list
  */
@@ -951,12 +959,25 @@
   =======================================================================*/
 
 int netdev_max_backlog = 300;
+/* these numbers are selected based on intuition and some
+experimentatiom, if you have more scientific way of doing this
+please go ahead and fix things
+*/
+int no_cong_thresh=10;
+int no_cong=20;
+int lo_cong=100;
+int mod_cong=290;
+
+
 
 struct netif_rx_stats netdev_rx_stat[NR_CPUS];
 
 
 #ifdef CONFIG_NET_HW_FLOWCONTROL
+/*
 static atomic_t netdev_dropping = ATOMIC_INIT(0);
+*/
+atomic_t netdev_dropping = ATOMIC_INIT(0);
 static unsigned long netdev_fc_mask = 1;
 unsigned long netdev_fc_xoff = 0;
 spinlock_t netdev_fc_lock = SPIN_LOCK_UNLOCKED;
@@ -1014,6 +1035,56 @@
 }
 #endif
 
+static void get_sample_stats(int cpu)
+{
+#ifdef RAND_LIE
+       unsigned long rd;
+       int rq;
+#endif
+       int blog=softnet_data[cpu].input_pkt_queue.qlen;
+       int avg_blog=softnet_data[cpu].avg_blog;
+
+       avg_blog=(avg_blog >> 1)+ (blog>>1);
+
+       if (avg_blog > mod_cong) {
+/*  above moderate congestion levels */
+               softnet_data[cpu].cng_level= BLG_CNG_HIGH;
+#ifdef RAND_LIE
+               rd=net_random();
+               rq=rd% netdev_max_backlog;
+               if (rq < avg_blog) /* unlucky bastard */
+                       softnet_data[cpu].cng_level=BLG_CNG_DROP;
+#endif
+       } else if (avg_blog > lo_cong) {
+               softnet_data[cpu].cng_level= BLG_CNG_MOD;
+#ifdef RAND_LIE
+               rd=net_random();
+               rq=rd% netdev_max_backlog;
+                       if (rq < avg_blog) /* unlucky bastard */
+                               softnet_data[cpu].cng_level=BLG_CNG_HIGH;
+#endif
+       } else if (avg_blog > no_cong) 
+               softnet_data[cpu].cng_level= BLG_CNG_LOW;
+       else  /* no congestion */
+               softnet_data[cpu].cng_level=BLG_CNG_NONE;
+
+       softnet_data[cpu].avg_blog=avg_blog;
+
+}
+
+#ifdef offline_samp
+static void sample_queue(unsigned long dummy)
+{
+/* 10 ms 0r 1ms -- i dont care -- JHS */
+       int next_tick=1;
+       int cpu=smp_processor_id();
+       get_sample_stats(cpu);
+       next_tick+=jiffies;
+       mod_timer(&samp_timer, next_tick);
+}
+#endif
+
+
 /**
  *     netif_rx        -       post buffer to the network code
  *     @skb: buffer to post
@@ -1022,9 +1093,18 @@
  *     the upper (protocol) levels to process.  It always succeeds. The buffer
  *     may be dropped during processing for congestion control or by the 
  *     protocol layers.
+ *      
+ *     return values:
+ *     BLG_CNG_NONE    (no congestion)           
+ *     BLG_CNG_LOW     (low congestion) 
+ *     BLG_CNG_MOD     (moderate congestion)
+ *     BLG_CNG_HIGH    (high congestion) 
+ *     BLG_CNG_DROP    (packet was dropped)
+ *      
+ *      
  */
 
-void netif_rx(struct sk_buff *skb)
+int netif_rx(struct sk_buff *skb)
 {
        int this_cpu = smp_processor_id();
        struct softnet_data *queue;
@@ -1043,6 +1123,7 @@
        netdev_rx_stat[this_cpu].total++;
        if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
                if (queue->input_pkt_queue.qlen) {
+
                        if (queue->throttle)
                                goto drop;
 
@@ -1054,11 +1135,16 @@
                        __skb_queue_tail(&queue->input_pkt_queue,skb);
                        __cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
                        local_irq_restore(flags);
-                       return;
+#ifndef offline_samp
+                       get_sample_stats(this_cpu);
+#endif
+                       return softnet_data[this_cpu].cng_level;
+
                }
 
                if (queue->throttle) {
                        queue->throttle = 0;
+
 #ifdef CONFIG_NET_HW_FLOWCONTROL
                        if (atomic_dec_and_test(&netdev_dropping))
                                netdev_wakeup();
@@ -1080,6 +1166,8 @@
        local_irq_restore(flags);
 
        kfree_skb(skb);
+       return BLG_CNG_DROP;
+
 }
 
 /* Deliver skb to an old protocol, which is not threaded well
@@ -1292,12 +1380,28 @@
 
                if (bugdet-- < 0 || jiffies - start_time > 1)
                        goto softnet_break;
+
+/* 
+*/
+#ifdef CONFIG_NET_HW_FLOWCONTROL
+       if (queue->throttle && queue->input_pkt_queue.qlen < no_cong_thresh ) {
+               if (atomic_dec_and_test(&netdev_dropping))  {
+                       queue->throttle = 0;
+                       netdev_wakeup();
+                       goto softnet_break;
+               }
+
+       }
+#endif
+
+
        }
        br_read_unlock(BR_NETPROTO_LOCK);
 
        local_irq_disable();
        if (queue->throttle) {
                queue->throttle = 0;
+
 #ifdef CONFIG_NET_HW_FLOWCONTROL
                if (atomic_dec_and_test(&netdev_dropping))
                        netdev_wakeup();
@@ -2418,6 +2522,7 @@
 int __init net_dev_init(void)
 {
        struct net_device *dev, **dp;
+
        int i;
 
 #ifdef CONFIG_NET_SCHED
@@ -2434,6 +2539,8 @@
                queue = &softnet_data[i];
                skb_queue_head_init(&queue->input_pkt_queue);
                queue->throttle = 0;
+               queue->cng_level = 0;
+               queue->avg_blog = 10; /* arbitrary non-zero */
                queue->completion_queue = NULL;
        }
        
@@ -2442,6 +2549,12 @@
        NET_PROFILE_REGISTER(dev_queue_xmit);
        NET_PROFILE_REGISTER(softnet_process);
 #endif
+
+#ifdef offline_samp
+       samp_timer.expires=jiffies+(10 * HZ);
+       add_timer(&samp_timer);
+#endif
+
        /*
         *      Add the devices.
         *      If the call to dev->init fails, the dev is removed
@@ -2499,6 +2612,7 @@
 #ifdef CONFIG_PROC_FS
        proc_net_create("dev", 0, dev_get_info);
        create_proc_read_entry("net/softnet_stat", 0, 0, dev_proc_stats, NULL);
+
 #ifdef WIRELESS_EXT
        proc_net_create("wireless", 0, dev_get_wireless_info);
 #endif /* WIRELESS_EXT */
--- linux/include/linux/netdevice.h.orig        Fri Sep  8 15:53:53 2000
+++ linux/include/linux/netdevice.h     Tue Sep 12 20:01:25 2000
@@ -47,6 +47,13 @@
                                           (TC use only - dev_queue_xmit
                                           returns this as NET_XMIT_SUCCESS) */
 
+/* Backlog congestion levels */
+#define BLG_CNG_NONE           1   /* keep 'em coming, baby */
+#define BLG_CNG_LOW            2   /* storm alert, just in case */
+#define BLG_CNG_MOD            3   /* Storm on its way! */
+#define BLG_CNG_HIGH           5   /* The Tsunami is here, boys and girls */
+#define BLG_CNG_DROP           -1  /* too late,mon ami! packet dropped */
+
 #define net_xmit_errno(e)      ((e) != NET_XMIT_CN ? -ENOBUFS : 0)
 
 #endif
@@ -440,6 +447,8 @@
 struct softnet_data
 {
        int                     throttle;
+       int                     cng_level;
+       int                     avg_blog;
        struct sk_buff_head     input_pkt_queue;
        struct net_device       *output_queue;
        struct sk_buff          *completion_queue;
@@ -526,7 +535,7 @@
 
 extern void            net_call_rx_atomic(void (*fn)(void));
 #define HAVE_NETIF_RX 1
-extern void            netif_rx(struct sk_buff *skb);
+extern int             netif_rx(struct sk_buff *skb);
 extern int             dev_ioctl(unsigned int cmd, void *);
 extern int             dev_change_flags(struct net_device *, unsigned);
 extern void            dev_queue_xmit_nit(struct sk_buff *skb, struct net_device 
*dev);
@@ -628,6 +637,7 @@
 extern void            netdev_unregister_fc(int bit);
 extern int             netdev_max_backlog;
 extern unsigned long   netdev_fc_xoff;
+extern atomic_t netdev_dropping;
 extern int             netdev_set_master(struct net_device *dev, struct net_device 
*master);
 #ifdef CONFIG_NET_FASTROUTE
 extern int             netdev_fastroute;
--- /usr/src/linux/net/netsyms.c        2000/09/16 23:00:33     1.1
+++ /usr/src/linux/net/netsyms.c        2000/09/16 23:01:28
@@ -490,6 +490,7 @@
 EXPORT_SYMBOL(dev_ioctl);
 EXPORT_SYMBOL(dev_queue_xmit);
 #ifdef CONFIG_NET_HW_FLOWCONTROL
+EXPORT_SYMBOL(netdev_dropping);
 EXPORT_SYMBOL(netdev_register_fc);
 EXPORT_SYMBOL(netdev_unregister_fc);
 EXPORT_SYMBOL(netdev_fc_xoff);

[PATCH/RFC] (long) network interface changes

Reply via email to