[LARTC] Re: [PATCH] HTB O(1) class lookup

2007-02-05 Thread Simon Lodal
On Monday 05 February 2007 11:16, Jarek Poplawski wrote:
 On 01-02-2007 12:30, Andi Kleen wrote:
  Simon Lodal [EMAIL PROTECTED] writes:
  Memory is generally not an issue, but CPU is, and you can not beat the
  CPU efficiency of plain array lookup (always faster, and constant time).

 Probably for some old (or embedded) lean boxes used for
 small network routers, with memory hungry iptables -
 memory could be an issue.

Sure, but if they are that constrained they probably do not run HTB in the 
first place.

We are talking about 4k initially, up to 256k worst case (or 512k if your 
router is 64bit, unlikely if small is a priority).

  And the worst memory consumption case considered by Patrick should
  be relatively unlikely.

 Anyway, such approach, that most users do something
 this (reasonable) way, doesn't look like good
 programming practice.

The current hash algorithm also assumes certain usage patterns, namely that 
you choose classids that generate different hash keys (= distribute uniformly 
across the buckets), or scalability will suffer very quickly. Even at 64 
classes you would probably see htb_find() near the top of a profiling 
analysis.

But I would say that it is just as unlikely as choosing 64 classids that cause 
my patch to allocate all 256k.

In these unlikely cases, my patch only wastes passive memory, while the 
current htb wastes cpu to a point where it can severely limit routing 
performance.


 I wonder, why not try, at least for a while, to do this
 a compile (menuconfig) option with a comment:
 recommended for a large number of classes. After hash
 optimization and some testing, final decisions could be
 made.

I decided not to do it because it would mean too many ifdefs 
(ugly+unmaintanable code).


Regards
Simon
___
LARTC mailing list
LARTC@mailman.ds9a.nl
http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc


[LARTC] Re: [PATCH] HTB O(1) class lookup

2007-02-05 Thread Simon Lodal
On Thursday 01 February 2007 12:30, Andi Kleen wrote:
 Simon Lodal [EMAIL PROTECTED] writes:
  Memory is generally not an issue, but CPU is, and you can not beat the
  CPU efficiency of plain array lookup (always faster, and constant time).

 Actually that's not true when the array doesn't fit in cache.

 The cost of going out to memory over caches is so large (factor 100 and
 more) that often algorithms with smaller cache footprint can easily beat
 algorithms that execute much less instructions if it has less cache misses.
 That is because not all instructions have the same cost; anything
 in core is very fast but going out to memory is very slow.

 That said I think I agree with your analysis that a two level
 array is probably the right data structure for this and likely
 not less cache efficient than the hash table.

Good point.

The 2-level lookup generates 3 memory accesses (including getting at the 
htb_class struct). List traversal will generate many more memory accesses, 
unless the lists have 3 or fewer entries (currently that only holds true for 
up to 48 classes, uniformly distributed).

It is difficult to judge if the tables will be in cache or not. The tables are 
clearly extra baggage for the cachelines, compared to only having the 
htb_class structs (they are going to be fetched anyway).

On the other hand, if you have 10k classes, they are usually (real world) 
allocated for individual users, of which at most half are active at any time. 
With hashing, all 10k classes are fetched into cachelines all the time, only 
in order to traverse lists. That is 150k wasted cache (5000 x 32 bytes); 
plenty for 10k pointers in lookup tables.


Regards
Simon
___
LARTC mailing list
LARTC@mailman.ds9a.nl
http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc


[LARTC] Re: [PATCH] HTB O(1) class lookup

2007-02-03 Thread Simon Lodal
On Thursday 01 February 2007 07:08, Patrick McHardy wrote:
 Simon Lodal wrote:
  This patch changes HTB's class storage from hash+lists to a two-level
  linear array, so it can do constant time (O(1)) class lookup by classid.
  It improves scalability for large number of classes.
 
  Without the patch, ~14k htb classes can starve a Xeon-3.2 at only 15kpps,
  using most of it's cycles traversing lists in htb_find(). The patch
  eliminates this problem, and has a measurable impact even with a few
  hundred classes.
 
  Previously, scalability could be improved by increasing HTB_HSIZE, modify
  the hash function, and recompile, but this patch works for everyone
  without recompile and scales better too.

 I agree that the current fixed sized hashes (additionally quite
 small by default) are a big problem with many classes, for all of
 HTB/HFSC/CBQ. But I think your approach is a bit wasteful, with
 unfortunately chosen classids 128 classes are enough to reach the
 maximum memory usage of ~512kb (with 4k pages and 8 byte pointers).

I think it is a non-issue since it does not happen in practice.

Generally there are two ways to assign classids:

1) Manually, used when you have few classes. People usually use 100, 101, 102, 
200, 201 etc (probably unaware that they are hex). With 4k pages and 32bit 
pointers, everything below classid 400 is within the first page, which covers 
most few classes examples you can find lying around.

2) Script generated, in practice this is required if you have many classes. 
The classid's will then usually be forthrunning, at least in large chunks, 
which means minimal memory waste, and an optimal case for plain linear 
lookup; hashing them can only be wasteful.


 I have a patch for HFSC which introduces dynamic resizing of the
 class hash. I have planned to generalize it (similar to tcf_hashinfo)
 and convert HTB and CBQ as well, which as a nice side effect will
 allow to get rid of some duplicated code, like hash walking.

I have not been looking into HFSC and CBQ, was not aware that they have 
similar issues.


 If you give me a few days I'll try to finish and post it.

Memory is generally not an issue, but CPU is, and you can not beat the CPU 
efficiency of plain array lookup (always faster, and constant time).

If anything, I would find it more relevant to use array lookup with dynamic 
adjustment of the array size (HTB_CLS_ARRAY_SIZE in my patch); start out 
small to waste less memory, increase up to PAGE_SIZE as needed.

But then, it is probably too much effort for the gain (a few kb's in machines 
that should have plenty of RAM anyway), and requires more code = more 
complexity, bugs, maintenance.


Regards
Simon
___
LARTC mailing list
LARTC@mailman.ds9a.nl
http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc


[LARTC] [PATCH] HTB O(1) class lookup

2007-01-31 Thread Simon Lodal

This patch changes HTB's class storage from hash+lists to a two-level linear 
array, so it can do constant time (O(1)) class lookup by classid. It improves 
scalability for large number of classes.

Without the patch, ~14k htb classes can starve a Xeon-3.2 at only 15kpps, 
using most of it's cycles traversing lists in htb_find(). The patch 
eliminates this problem, and has a measurable impact even with a few hundred 
classes.

Previously, scalability could be improved by increasing HTB_HSIZE, modify the 
hash function, and recompile, but this patch works for everyone without 
recompile and scales better too.

The patch is for 2.6.20-rc6, I have older ones for 2.6.18 and 2.6.19 if anyone 
is interested.

Signed-off-by: Simon Lodal [EMAIL PROTECTED]
--- linux-2.6.20-rc6.base/net/sched/sch_htb.c	2007-01-25 03:19:28.0 +0100
+++ linux-2.6.20-rc6/net/sched/sch_htb.c	2007-02-01 05:44:36.0 +0100
@@ -68,16 +68,37 @@
 one less than their parent.
 */
 
-#define HTB_HSIZE 16		/* classid hash size */
-#define HTB_EWMAC 2		/* rate average over HTB_EWMAC*HTB_HSIZE sec */
-#define HTB_RATECM 1		/* whether to use rate computer */
+#define HTB_MAX_CLS		(TC_H_MIN(-1)+1)
+#define HTB_CLS_ARRAY_SIZE	PAGE_SIZE
+#define HTB_CLS_PER_ARRAY	(HTB_CLS_ARRAY_SIZE / sizeof(void*))
+#define HTB_CLS_ARRAYS		(HTB_MAX_CLS / HTB_CLS_PER_ARRAY)
+#define HTB_CLS_ARRAY(CID)	(CID / HTB_CLS_PER_ARRAY)
+#define HTB_CLS_INDEX(CID)	(CID  (HTB_CLS_PER_ARRAY-1))
+
+
 #define HTB_HYSTERESIS 1	/* whether to use mode hysteresis for speedup */
-#define HTB_VER 0x30011		/* major must be matched with number suplied by TC as version */
+#define HTB_VER 0x30012		/* major must be matched with number suplied by TC as version */
 
 #if HTB_VER  16 != TC_HTB_PROTOVER
 #error Mismatched sch_htb.c and pkt_sch.h
 #endif
 
+/* Whether to use rate computer. This is only used for statistics output to
+   userspace (tc -s class show dev ...); if you do not care about that and
+   want the last bit of performance, disable this. */
+#define HTB_RATECM
+#ifdef HTB_RATECM
+/* Time between rate computation updates, in seconds, for each class. */
+#define HTB_RATECM_UPDATE_INTERVAL	16
+/* How many HTB_RATECM_UPDATE_INTERVAL intervals to average over when
+   computing the rate. If set to 1, the computed rate will be exactly the
+   observed rate of the last interval. If set to higher values, the computed
+   rate will converge slower, but also vary less with small, temporary changes
+   in traffic.
+*/
+#define HTB_RATECM_AVERAGE_INTERVALS	2
+#endif
+
 /* used internaly to keep status of single class */
 enum htb_cmode {
 	HTB_CANT_SEND,		/* class can't send and can't borrow */
@@ -104,7 +125,7 @@
 	/* topology */
 	int level;		/* our level (see above) */
 	struct htb_class *parent;	/* parent class */
-	struct hlist_node hlist;	/* classid hash list item */
+	struct list_head clist;		/* classid list item */
 	struct list_head sibling;	/* sibling list item */
 	struct list_head children;	/* children list */
 
@@ -165,9 +186,13 @@
 	return rate-data[slot];
 }
 
+typedef struct htb_class* htb_cls_array[HTB_CLS_PER_ARRAY];
+
 struct htb_sched {
-	struct list_head root;	/* root classes list */
-	struct hlist_head hash[HTB_HSIZE];	/* hashed by classid */
+	struct list_head root;		/* root classes list */
+	struct list_head clist;		/* all classes list */
+	/* all classes arrays for fast lookup */
+	htb_cls_array* classes[HTB_CLS_ARRAYS];
 	struct list_head drops[TC_HTB_NUMPRIO];/* active leaves (for drops) */
 
 	/* self list - roots of self generating tree */
@@ -198,8 +223,10 @@
 	psched_time_t now;	/* cached dequeue time */
 	struct timer_list timer;	/* send delay timer */
 #ifdef HTB_RATECM
-	struct timer_list rttim;	/* rate computer timer */
-	int recmp_bucket;	/* which hash bucket to recompute next */
+	struct timer_list rttim;/* rate computer timer */
+	int clscnt;		/* total number of classes */
+	struct list_head *rtcur;/* current class to update rate timer for */
+	int rtiter;		/* current iteration (1..UPDATE_INTERVAL) */
 #endif
 
 	/* non shaped skbs; let them go directly thru */
@@ -209,32 +236,22 @@
 	long direct_pkts;
 };
 
-/* compute hash of size HTB_HSIZE for given handle */
-static inline int htb_hash(u32 h)
-{
-#if HTB_HSIZE != 16
-#error Declare new hash for your HTB_HSIZE
-#endif
-	h ^= h  8;		/* stolen from cbq_hash */
-	h ^= h  4;
-	return h  0xf;
-}
-
-/* find class in global hash table using given handle */
+/* find class in class arrays using given handle */
 static inline struct htb_class *htb_find(u32 handle, struct Qdisc *sch)
 {
 	struct htb_sched *q = qdisc_priv(sch);
-	struct hlist_node *p;
-	struct htb_class *cl;
+	htb_cls_array* a;
+	int minorid;
 
 	if (TC_H_MAJ(handle) != sch-handle)
 		return NULL;
 
-	hlist_for_each_entry(cl, p, q-hash + htb_hash(handle), hlist) {
-		if (cl-classid == handle)
-			return cl;
-	}
-	return NULL;
+	minorid = TC_H_MIN(handle);
+	a = q-classes[HTB_CLS_ARRAY(minorid)];
+	if (a == NULL)
+		return

Re: [LARTC] how to setup massive traffic shaping? (2 class B nets)

2006-09-20 Thread Simon Lodal

If you use HTB, you need to compile it with HTB_HSIZE set to at least 256 (in 
sch_htb.c). Else your CPU will be fully loaded with even a few kpps traffic.

The problem is how HTB stores the classes, not very efficient when there are 
thousands of them. I do not know if other qdiscs have the same problem.

I am working on a better patch for that, but it is not ready yet.


Regards,
Simon


On Tuesday 19 September 2006 13:42, Тимур Сафин wrote:
 Hello
 I have 2 class-B networks (172.22.0.0/16 and 172.23.0.0/16, over 130k
 of ip's) and need to setup
 traffic tbf shapers  with 64kb/s for each ip from 172.22.0.0/16 and
 128kb/s for each ip from 172.23.0.0/16
 just read lartc and don't understand how to use u32 for decreasing
 number of rules and hashing
 ___
 LARTC mailing list
 LARTC@mailman.ds9a.nl
 http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc
___
LARTC mailing list
LARTC@mailman.ds9a.nl
http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc


Re: [LARTC] process id with firewall and tc

2006-09-16 Thread Simon Lodal

Routing, firewalling and shaping run in kernel and has no pid. Instead you can 
get/set /proc flags, and check for the presence of certain data structures.

/proc/sys/net/ipv4/ip_forward is the routing master switch. If 0, the machine 
forwards nothing. You can both set and get the value, should be relatively 
easy from a web page. Beware that setting it to 1 may reset other /proc keys 
to default values.

For iptables firewalling you probably need to check if your rules are loaded 
or not, a script parsing 'iptables -nL' output could do it. Or you could use 
a condition match enabled in the beginning of each table, and drop all 
traffic if the condition is false. The /proc/net/ipt_condition/enabled value 
can then be read and set as a master switch from the web page.

Shaping has no /proc files, and no way to create a master switch, so you need 
a script that parses 'tc qdisc show dev eth0' or 'tc class show dev eth0' 
output.


Regards,
Simon


On Saturday 16 September 2006 15:38, William Bohannan wrote:
 Not sure this is the correct place to post this but I am looking to have
 status of the firewall and traffic control (active, disabled, stopped etc)
 on a webpage controlled via something like pid as the machine has many
 things running on it, like firewall, traffic control, data collection for
 graphing the traffic flows, as well as other services like squid etc.  Any
 ideas would be most helpful.



 Kind Regards



 William
___
LARTC mailing list
LARTC@mailman.ds9a.nl
http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc


Re: [LARTC] Performance problem on a linux bridge used for shaping.

2006-09-16 Thread Simon Lodal

I have similar hardware, load and trouble.

Interrupts are only sent to one cpu, instead of all of them, because that was 
only overhead. I think the default was changed somewhere around 2.6.10 
or .12, but I have forgotten the url.

There is a CONFIG_IRQBALANCE option in the kernel, but last time I checked 
(2.6.16) it did not work very well; almost never does anything. So I have 
turned it off. I use the userspace irqbalance daemon that periodically sets 
the smp_affinity's, with the effect that ingoing and outgoing traffic are 
handled by each their cpu (assuming that no other interrupts or processes are 
significant). That helps some. But we only shape in one direction, and it can 
not help spread the shaping load between the CPU's.

There is also an acpi_irq_balance kernel parameter (not related to the kernel 
irq balancer), which apparently uses the APIC to do interrupt round-robin. It 
worked surprisingly well (perfect, actually) on an old dual celeron that I 
tested; the network interrupts are spread nicely and evenly across the cpu's. 
It is probably very chipset dependent, and I have not yet tested it on the 
firewalls.

But as I understand from this:
http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2006/09/14#netconf2006_day2
it might not even be an advantage, since the current shaping code would just 
make the cpu's step on each others toes.


Regards,
Simon


On Saturday 16 September 2006 20:39, Alexandru Dragoi wrote:
 Hello,

 Here is the situation. There is a machine with 3 intel gigabit card, 2
 of them on PCI-X and in bridge, the 3rd is used only for management
 access. The machine is a dual Xeon 2.8GHz with HT. With 2.6.8 kernel
 from debian (testing) and htb with u32 on, i usually get about 30-40%
 software interrupts on CPU0 and CPU2, and without htb and u32, 10% less.
 Now, if I boot with 2.6.17.9 kernel, first all irqs are on same CPU. I
 managed with smp_afinity to move irq of one card to a different CPU.
 In these circumstances, I have about 20% or a little less soft
 interrupts on each CPU without shaping, but about 60-70% os soft
 interrupts with shapping, and sometimes there is packet loss, also
 dropped packets are shown on ifconfig. The htb script is same. I have
 u32 performance counters enabled on u32 in 2.6.17.9. I also have NAPI,
 of course. I can't think on anything else that can cause the problem,
 but seem to be something in the kernel. Here is the output of lspci:

 # lspci
 :00:00.0 Host bridge: Intel Corporation E7320 Memory Controller Hub
 (rev 0a)
 :00:00.1 ff00: Intel Corporation E7320 Error Reporting Registers
 (rev 0a)
 :00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express
 Port A (rev 0a)
 :00:03.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express
 Port A1 (rev 0a)
 :00:1c.0 PCI bridge: Intel Corporation 6300ESB 64-bit PCI-X Bridge
 (rev 02)
 :00:1d.0 USB Controller: Intel Corporation 6300ESB USB Universal
 Host Controller (rev 02)
 :00:1d.1 USB Controller: Intel Corporation 6300ESB USB Universal
 Host Controller (rev 02)
 :00:1d.4 System peripheral: Intel Corporation 6300ESB Watchdog Timer
 (rev 02)
 :00:1d.5 PIC: Intel Corporation 6300ESB I/O Advanced Programmable
 Interrupt Controller (rev 02)
 :00:1d.7 USB Controller: Intel Corporation 6300ESB USB2 Enhanced
 Host Controller (rev 02)
 :00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 0a)
 :00:1f.0 ISA bridge: Intel Corporation 6300ESB LPC Interface
 Controller (rev 02)
 :00:1f.2 IDE interface: Intel Corporation 6300ESB SATA Storage
 Controller (rev 02)
 :00:1f.3 SMBus: Intel Corporation 6300ESB SMBus Controller (rev 02)
 :03:02.0 Ethernet controller: Intel Corporation 82544EI Gigabit
 Ethernet Controller (Copper) (rev 02)
 :03:03.0 Ethernet controller: Intel Corporation 82544EI Gigabit
 Ethernet Controller (Copper) (rev 02)
 :04:02.0 VGA compatible controller: ATI Technologies Inc Rage XL
 (rev 27)
 :04:03.0 Ethernet controller: Intel Corporation 82541GI Gigabit
 Ethernet Controller (rev 05)

 The traffic is somewhere at 40kpps of traffic and 120mbit up,
 120mbit/down. Suggestions about better hardware and kernel, or links
 with docs about these are really welcomed.

 Bye

 ___
 LARTC mailing list
 LARTC@mailman.ds9a.nl
 http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc
___
LARTC mailing list
LARTC@mailman.ds9a.nl
http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc


[LARTC] Large number of HTB classes

2004-08-27 Thread Simon Lodal
I am planning a setup with thousands of classes in a HTB qdisc, say from 
1:1000 to 1:2000, each with a very small rate and a big ceil, for fair 
sharing of a 45mbit link.

I suspect some problems could be lurking in there.
Anyone having good/bad experience with such number of classes?
Simon
___
LARTC mailing list / [EMAIL PROTECTED]
http://mailman.ds9a.nl/mailman/listinfo/lartc HOWTO: http://lartc.org/


Re: [LARTC] Large number of HTB classes

2004-08-27 Thread Simon Lodal
HFSC seems interesting, but does it really work well? I am not afraid of 
new stuff, but bleeding edge is probably too risky.

That HTB problem, I guess you mean it is possible to have available 
bandwidth, but when some of it has been distributed between users, all 
users' ceils go below their quantums?

But that implies that users have low ceils? Wouldn't it be solved by 
setting all users' ceil to the full link bandwidth?

The doubly linked list patch you mention, I believe it should be in 
vanilla kernel since 2.4.20-22 or something? I use 2.4.27 here. If not, 
do you have a link for it?

Simon
Tomasz Paszkowski skrev:
On Fri, Aug 27, 2004 at 10:46:59AM +0200, Simon Lodal wrote:
I am planning a setup with thousands of classes in a HTB qdisc, say from 
1:1000 to 1:2000, each with a very small rate and a big ceil, for fair 
sharing of a 45mbit link.

Consider using HFSC. HTB is not the best solution for such a number of classes
with small rate. Users will not be able to get the whole ceil even if
there will be avaliable bandwidth.
Secondly make sure your're using kernel with double linked list patch applied
to qdisc api.
___
LARTC mailing list / [EMAIL PROTECTED]
http://mailman.ds9a.nl/mailman/listinfo/lartc HOWTO: http://lartc.org/