Re: [PATCH] [REVISED] net/ipv4/multipath_wrandom.c: check kmalloc() return value.
On Mon, Mar 12, 2007 at 01:56:13PM -0700, David Miller wrote: From: Pekka J Enberg [EMAIL PROTECTED] Date: Mon, 12 Mar 2007 14:15:16 +0200 (EET) On 3/9/07, David Miller [EMAIL PROTECTED] wrote: The whole cahce-multipath subsystem has to have it's guts revamped for proper error handling. (Untested patch follows.) I'm not accepting untested patches, if people want to fix this code it cannot be done in a shoddy manner any more, it will need to be fixed and tested by people who really care about this code. Hi Dave, We do care for multipath cached option. We do use it often with node aware multipath. We haven't experienced oopsen or crashes for our use cases since the past year. We will test Amit and Pekka's patch with fault injection. We will also go through the core multipath-cached code and give a try at cleaning it up. We need real care for this code or I will remove it in 2.6.23 Can you please point us to any open bug reports on multipath cached? We did a Call for testing on netdev sometime back but did not get any bug reports. http://marc.info/?l=linux-netdevm=114827370231872w=2 The kernel bugzilla shows zaroo boogs for multipath cached as well. Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] add dev_to_node()
On Sun, Nov 05, 2006 at 12:53:23AM +0100, Christoph Hellwig wrote: On Sat, Nov 04, 2006 at 06:06:48PM -0500, Dave Jones wrote: On Sat, Nov 04, 2006 at 11:56:29PM +0100, Christoph Hellwig wrote: This will break the compile for !NUMA if someone ends up doing a bisect and lands here as a bisect point. You introduce this nice wrapper.. The dev_to_node wrapper is not enough as we can't assign to (-1) for the non-NUMA case. So I added a second macro, set_dev_node for that. The patch below compiles and works on numa and non-NUMA platforms. Hi Christoph, dev_to_node does not work as expected on x86_64 (and i386). This is because node value returned by pcibus_to_node is initialized after a struct device is created with current x86_64 code. We need the node value initialized before the call to pci_scan_bus_parented, as the generic devices are allocated and initialized off pci_scan_child_bus, which gets called from pci_scan_bus_parented The following patch does that using pci_sysdata introduced by the PCI domain patches in -mm. Signed-off-by: Alok N Kataria [EMAIL PROTECTED] Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.19-rc4mm2/arch/i386/pci/acpi.c === --- linux-2.6.19-rc4mm2.orig/arch/i386/pci/acpi.c 2006-11-06 11:03:50.0 -0800 +++ linux-2.6.19-rc4mm2/arch/i386/pci/acpi.c2006-11-06 22:04:14.0 -0800 @@ -9,6 +9,7 @@ struct pci_bus * __devinit pci_acpi_scan { struct pci_bus *bus; struct pci_sysdata *sd; + int pxm; /* Allocate per-root-bus (not per bus) arch-specific data. * TODO: leak; this memory is never freed. @@ -30,15 +31,21 @@ struct pci_bus * __devinit pci_acpi_scan } #endif /* CONFIG_PCI_DOMAINS */ + sd-node = -1; + + pxm = acpi_get_pxm(device-handle); +#ifdef CONFIG_ACPI_NUMA + if (pxm = 0) + sd-node = pxm_to_node(pxm); +#endif + bus = pci_scan_bus_parented(NULL, busnum, pci_root_ops, sd); if (!bus) kfree(sd); #ifdef CONFIG_ACPI_NUMA if (bus != NULL) { - int pxm = acpi_get_pxm(device-handle); if (pxm = 0) { - sd-node = pxm_to_node(pxm); printk(bus %d - pxm %d - node %d\n, busnum, pxm, sd-node); } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[CFT] CONFIG_IP_ROUTE_MULTIPATH_CACHED [was Re: [PATCH] Disable multipath cache routing ]
On Sat, May 20, 2006 at 01:51:45PM -0700, Chris Wedgwood wrote: ... Anyhow, the code as-is hasn't been maintained for a long time except for a few minor blips (I'm using hg's annotate to find those and have included those people on the cc' list as presumably there are using these features and might have useful input). We are interested in keeping CONFIG_IP_ROUTE_MULTIPATH_CACHED, as this enables us to do node aware multipathing http://lkml.org/lkml/2006/3/22/13 Hence, we are also interested in stabilizing MULTIPATH_CACHED. Please send bug reports with the test case scenario to myself and Pravin at [EMAIL PROTECTED] and [EMAIL PROTECTED] Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/2] net: Node aware multipath device round robin
Following patch adds in node aware, device round robin ip multipathing. It is based on multipath_drr.c, the multipath device round robin algorithm, and is derived from it. This implementation maintians per node state table, and round robins between interfaces on the same node. The implementation needs to be aware of the NIC proximity to a node. Hence we have added a nodeid field to struct netdevice. NIC device drivers can initialize this with the node id the NIC belongs to. This patch uses IP_MP_ALG_DRR slot like the regular multipath_drr too. So either SMP multipath_drr or node aware multipath_node_drr should be used for device round robin, based on system having proximity information for the NICs. Performance results: 1. Single NIC test -- 1 client targets 1 nic on the server with 300 concurrent requests. 2. 4 NIC test -- 1 client targets 4 nics, all on different nodes on the server with 300 concurrent requests. We see about 135% improvement on AB requests per second with this patch and the device_locality_check patch on single NIC test, on the Rackable c5100 machine (server). We see about 64% improvement when all 4 NICS are targeted. Credits: This work was originally done by Justin Forbes Comments? Signed-off by: Pravin B. Shelar [EMAIL PROTECTED] Signed-off by: Shobhit Dayal [EMAIL PROTECTED] Signed-off by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.16/drivers/net/e1000/e1000_main.c === --- linux-2.6.16.orig/drivers/net/e1000/e1000_main.c2006-03-19 21:53:29.0 -0800 +++ linux-2.6.16/drivers/net/e1000/e1000_main.c 2006-03-20 14:52:23.0 -0800 @@ -692,6 +692,7 @@ e1000_probe(struct pci_dev *pdev, SET_MODULE_OWNER(netdev); SET_NETDEV_DEV(netdev, pdev-dev); + SET_NETDEV_NODE(netdev, pcibus_to_node(pdev-bus)); pci_set_drvdata(pdev, netdev); adapter = netdev_priv(netdev); Index: linux-2.6.16/drivers/net/tg3.c === --- linux-2.6.16.orig/drivers/net/tg3.c 2006-03-19 21:53:29.0 -0800 +++ linux-2.6.16/drivers/net/tg3.c 2006-03-20 14:52:23.0 -0800 @@ -10705,6 +10705,7 @@ static int __devinit tg3_init_one(struct SET_MODULE_OWNER(dev); SET_NETDEV_DEV(dev, pdev-dev); + SET_NETDEV_NODE(dev, pcibus_to_node(pdev-bus)); dev-features |= NETIF_F_LLTX; #if TG3_VLAN_TAG_USED Index: linux-2.6.16/include/linux/netdevice.h === --- linux-2.6.16.orig/include/linux/netdevice.h 2006-03-19 21:53:29.0 -0800 +++ linux-2.6.16/include/linux/netdevice.h 2006-03-20 14:52:23.0 -0800 @@ -315,7 +315,9 @@ struct net_device /* Interface index. Unique device identifier*/ int ifindex; int iflink; - +#ifdef CONFIG_NUMA + int node; /* NUMA node this IF is close to */ +#endif struct net_device_stats* (*get_stats)(struct net_device *dev); struct iw_statistics* (*get_wireless_stats)(struct net_device *dev); @@ -520,6 +522,14 @@ static inline void *netdev_priv(struct n */ #define SET_NETDEV_DEV(net, pdev) ((net)-class_dev.dev = (pdev)) +#ifdef CONFIG_NUMA +#define SET_NETDEV_NODE(dev, nodeid) ((dev)-node = (nodeid)) +#define netdev_node(dev) ((dev)-node) +#else +#define SET_NETDEV_NODE(dev, nodeid) do {} while (0) +#define netdev_node(dev) (-1) +#endif + struct packet_type { __be16 type; /* This is really htons(ether_type). */ struct net_device *dev; /* NULL is wildcarded here */ Index: linux-2.6.16/net/core/dev.c === --- linux-2.6.16.orig/net/core/dev.c2006-03-19 21:53:29.0 -0800 +++ linux-2.6.16/net/core/dev.c 2006-03-20 14:52:23.0 -0800 @@ -3003,7 +3003,8 @@ struct net_device *alloc_netdev(int size if (sizeof_priv) dev-priv = netdev_priv(dev); - + + SET_NETDEV_NODE(dev, -1); setup(dev); strcpy(dev-name, name); return dev; Index: linux-2.6.16/net/ipv4/Kconfig === --- linux-2.6.16.orig/net/ipv4/Kconfig 2006-03-19 21:53:29.0 -0800 +++ linux-2.6.16/net/ipv4/Kconfig 2006-03-20 14:52:23.0 -0800 @@ -164,6 +164,15 @@ config IP_ROUTE_MULTIPATH_DRR available interfaces. This policy makes sense if the connections should be primarily distributed on interfaces and not on routes. +config IP_ROUTE_MULTIPATH_NODE + tristate MULTIPATH: interface RR algorithm with node affinity + depends on IP_ROUTE_MULTIPATH_CACHED NUMA !IP_ROUTE_MULTIPATH_DRR + help + This
[patch 2/2] net: Node aware multipath device round robin -- device locality check
This patch checks device locality on every ip packet xmit. In multipath configuration tcp connection to route association is done at session startup time. The tcp session process is migrated to different nodes after this association. This would mean a remote NIC is chosen for xmit, although a local NIC could be available. Following patch checks if a local NIC is available for the desitnation, and recalculates routes if so. his leads to remote NIC transfer in some tcp work load such as AB. Downside: adds a bitmap to struct rtable. But only if CONFIG_IP_ROUTE_MULTIPATH_NODE is enabled. Comments, suggestions welcome. Signed-off by: Pravin B. Shelar [EMAIL PROTECTED] Signed-off by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.16/include/net/route.h === --- linux-2.6.16.orig/include/net/route.h 2006-03-19 21:53:29.0 -0800 +++ linux-2.6.16/include/net/route.h2006-03-20 14:52:24.0 -0800 @@ -75,6 +75,13 @@ struct rtable /* Miscellaneous cached information */ __u32 rt_spec_dst; /* RFC1122 specific destination */ struct inet_peer*peer; /* long-living peer info */ +#ifdef CONFIG_IP_ROUTE_MULTIPATH_NODE + /* bitmap bit is set if current node has a local multi-path device for +* this route. +*/ + DECLARE_BITMAP (mp_if_bitmap, MAX_NUMNODES); +#endif + }; struct ip_rt_acct @@ -201,4 +208,21 @@ static inline struct inet_peer *rt_get_p extern ctl_table ipv4_route_table[]; +#ifdef CONFIG_IP_ROUTE_MULTIPATH_NODE + +#include linux/ip_mp_alg.h + +static inline int dst_dev_node_check(struct rtable *rt) +{ + int cnode = numa_node_id(); + if (unlikely(netdev_node(rt-u.dst.dev) != cnode)) { + if (test_bit(cnode, rt-mp_if_bitmap)) + return 1; + } + return 0; +} +#else +#define dst_dev_node_check(rt) 0 +#endif + #endif /* _ROUTE_H */ Index: linux-2.6.16/net/ipv4/ip_output.c === --- linux-2.6.16.orig/net/ipv4/ip_output.c 2006-03-19 21:53:29.0 -0800 +++ linux-2.6.16/net/ipv4/ip_output.c 2006-03-20 14:52:24.0 -0800 @@ -309,7 +309,7 @@ int ip_queue_xmit(struct sk_buff *skb, i /* Make sure we can route this packet. */ rt = (struct rtable *)__sk_dst_check(sk, 0); - if (rt == NULL) { + if ((rt == NULL ) || dst_dev_node_check(rt)) { u32 daddr; /* Use correct destination address if we have options. */ Index: linux-2.6.16/net/ipv4/route.c === --- linux-2.6.16.orig/net/ipv4/route.c 2006-03-19 21:53:29.0 -0800 +++ linux-2.6.16/net/ipv4/route.c 2006-03-20 14:52:24.0 -0800 @@ -2313,6 +2313,22 @@ static inline int ip_mkroute_output(stru if (res-fi res-fi-fib_nhs 1) { unsigned char hopcount = res-fi-fib_nhs; +#ifdef CONFIG_IP_ROUTE_MULTIPATH_NODE + DECLARE_BITMAP (mp_if_bitmap, MAX_NUMNODES); + bitmap_zero(mp_if_bitmap, MAX_NUMNODES); + /* Calculating device bitmap for this multipath route */ + if (res-fi-fib_mp_alg == IP_MP_ALG_DRR) { + for (hop = 0; hop hopcount; hop++) { + struct net_device *dev2nexthop; + + res-nh_sel = hop; + dev2nexthop = FIB_RES_DEV(*res); + dev_hold(dev2nexthop); + set_bit(netdev_node(dev2nexthop), mp_if_bitmap); + dev_put(dev2nexthop); + } + } +#endif for (hop = 0; hop hopcount; hop++) { struct net_device *dev2nexthop; @@ -2343,6 +2359,10 @@ static inline int ip_mkroute_output(stru FIB_RES_NETMASK(*res), res-prefixlen, FIB_RES_NH(*res)); + +#ifdef CONFIG_IP_ROUTE_MULTIPATH_NODE + bitmap_copy(rth-mp_if_bitmap, mp_if_bitmap, MAX_NUMNODES); +#endif cleanup: /* release work reference to output device */ dev_put(dev2nexthop); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh
On Wed, Mar 08, 2006 at 04:32:58PM -0800, Andrew Morton wrote: Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: On Wed, Mar 08, 2006 at 03:43:21PM -0800, Andrew Morton wrote: Benjamin LaHaise [EMAIL PROTECTED] wrote: I think it may make more sense to simply convert local_t into a long, given that most of the users will be things like stats counters. Yes, I agree that making local_t signed would be better. It's consistent with atomic_t, atomic64_t and atomic_long_t and it's a bit more flexible. Perhaps. A lot of applications would just be upcounters for statistics, where unsigned is desired. But I think the consistency argument wins out. It already is... for most of the arches except x86_64. x86 uses unsigned long. Here's a patch making x86_64 local_t to 64 bits like other 64 bit arches. This keeps local_t unsigned long. (We can change it to signed value along with other arches later in one go I guess) Thanks, Kiran Change x86_64 local_t to 64 bits like all other arches. Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Index: linux-2.6.16-rc5mm3/include/asm-x86_64/local.h === --- linux-2.6.16-rc5mm3.orig/include/asm-x86_64/local.h 2006-03-08 16:51:31.0 -0800 +++ linux-2.6.16-rc5mm3/include/asm-x86_64/local.h 2006-03-08 21:56:01.0 -0800 @@ -5,18 +5,18 @@ typedef struct { - volatile unsigned int counter; + volatile long counter; } local_t; #define LOCAL_INIT(i) { (i) } -#define local_read(v) ((v)-counter) +#define local_read(v) ((unsigned long)(v)-counter) #define local_set(v,i) (((v)-counter) = (i)) static __inline__ void local_inc(local_t *v) { __asm__ __volatile__( - incl %0 + incq %0 :=m (v-counter) :m (v-counter)); } @@ -24,7 +24,7 @@ static __inline__ void local_inc(local_t static __inline__ void local_dec(local_t *v) { __asm__ __volatile__( - decl %0 + decq %0 :=m (v-counter) :m (v-counter)); } @@ -32,7 +32,7 @@ static __inline__ void local_dec(local_t static __inline__ void local_add(unsigned int i, local_t *v) { __asm__ __volatile__( - addl %1,%0 + addq %1,%0 :=m (v-counter) :ir (i), m (v-counter)); } @@ -40,7 +40,7 @@ static __inline__ void local_add(unsigne static __inline__ void local_sub(unsigned int i, local_t *v) { __asm__ __volatile__( - subl %1,%0 + subq %1,%0 :=m (v-counter) :ir (i), m (v-counter)); } @@ -71,4 +71,4 @@ static __inline__ void local_sub(unsigne #define __cpu_local_add(i, v) cpu_local_add((i), (v)) #define __cpu_local_sub(i, v) cpu_local_sub((i), (v)) -#endif /* _ARCH_I386_LOCAL_H */ +#endif /* _ARCH_X8664_LOCAL_H */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh
On Thu, Mar 09, 2006 at 07:14:26PM +1100, Nick Piggin wrote: Ravikiran G Thirumalai wrote: Here's a patch making x86_64 local_t to 64 bits like other 64 bit arches. This keeps local_t unsigned long. (We can change it to signed value along with other arches later in one go I guess) Why not just keep naming and structure of interfaces consistent with atomic_t? That would be signed and 32-bit. You then also have a local64_t. No, local_t is supposed to be 64-bits on 64bits arches and 32 bit on 32 bit arches. x86_64 was the only exception, so this patch fixes that. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: percpufy frequently used vars -- proto.sockets_allocated
On Tue, Mar 07, 2006 at 06:16:02PM -0800, Andrew Morton wrote: Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: +static inline int read_sockets_allocated(struct proto *prot) +{ + int total = 0; + int cpu; + for_each_cpu(cpu) + total += *per_cpu_ptr(prot-sockets_allocated, cpu); + return total; +} This is likely too big to be inlined, plus sock.h doesn't include enough headers to reliably compile this code. +static inline void mod_sockets_allocated(int *sockets_allocated, int count) +{ + (*per_cpu_ptr(sockets_allocated, get_cpu())) += count; + put_cpu(); +} + Ditto. OK, here is a revised patch. Change the atomic_t sockets_allocated member of struct proto to a per-cpu counter. Signed-off-by: Pravin B. Shelar [EMAIL PROTECTED] Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.16-rc5mm3/include/net/sock.h === --- linux-2.6.16-rc5mm3.orig/include/net/sock.h 2006-03-08 11:03:19.0 -0800 +++ linux-2.6.16-rc5mm3/include/net/sock.h 2006-03-08 12:06:52.0 -0800 @@ -543,7 +543,7 @@ struct proto { /* Memory pressure */ void(*enter_memory_pressure)(void); struct percpu_counter *memory_allocated; /* Current allocated memory. */ - atomic_t*sockets_allocated; /* Current number of sockets. */ + int *sockets_allocated; /* Current number of sockets (percpu counter). */ /* * Pressure flag: try to collapse. @@ -579,6 +579,12 @@ struct proto { } stats[NR_CPUS]; }; +extern int read_sockets_allocated(struct proto *prot); +extern void mod_sockets_allocated(int *sockets_allocated, int count); + +#define inc_sockets_allocated(c) mod_sockets_allocated(c, 1) +#define dec_sockets_allocated(c) mod_sockets_allocated(c, -1) + extern int proto_register(struct proto *prot, int alloc_slab); extern void proto_unregister(struct proto *prot); Index: linux-2.6.16-rc5mm3/include/net/tcp.h === --- linux-2.6.16-rc5mm3.orig/include/net/tcp.h 2006-03-08 11:03:19.0 -0800 +++ linux-2.6.16-rc5mm3/include/net/tcp.h 2006-03-08 11:39:40.0 -0800 @@ -226,7 +226,7 @@ extern int sysctl_tcp_mtu_probing; extern int sysctl_tcp_base_mss; extern struct percpu_counter tcp_memory_allocated; -extern atomic_t tcp_sockets_allocated; +extern int *tcp_sockets_allocated; extern int tcp_memory_pressure; /* Index: linux-2.6.16-rc5mm3/net/core/sock.c === --- linux-2.6.16-rc5mm3.orig/net/core/sock.c2006-03-08 11:03:19.0 -0800 +++ linux-2.6.16-rc5mm3/net/core/sock.c 2006-03-08 12:09:19.0 -0800 @@ -771,7 +771,7 @@ struct sock *sk_clone(const struct sock newsk-sk_sleep = NULL; if (newsk-sk_prot-sockets_allocated) - atomic_inc(newsk-sk_prot-sockets_allocated); + inc_sockets_allocated(newsk-sk_prot-sockets_allocated); } out: return newsk; @@ -1451,6 +1451,25 @@ void sk_common_release(struct sock *sk) EXPORT_SYMBOL(sk_common_release); +int read_sockets_allocated(struct proto *prot) +{ + int total = 0; + int cpu; + for_each_cpu(cpu) + total += *per_cpu_ptr(prot-sockets_allocated, cpu); + return total; +} + +EXPORT_SYMBOL(read_sockets_allocated); + +void mod_sockets_allocated(int *sockets_allocated, int count) +{ + (*per_cpu_ptr(sockets_allocated, get_cpu())) += count; + put_cpu(); +} + +EXPORT_SYMBOL(mod_sockets_allocated); + static DEFINE_RWLOCK(proto_list_lock); static LIST_HEAD(proto_list); @@ -1620,7 +1639,7 @@ static void proto_seq_printf(struct seq_ %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n, proto-name, proto-obj_size, - proto-sockets_allocated != NULL ? atomic_read(proto-sockets_allocated) : -1, + proto-sockets_allocated != NULL ? read_sockets_allocated(proto) : -1, proto-memory_allocated != NULL ? percpu_counter_read_positive(proto-memory_allocated) : -1, proto-memory_pressure != NULL ? *proto-memory_pressure ? yes : no : NI, Index: linux-2.6.16-rc5mm3/net/core/stream.c === --- linux-2.6.16-rc5mm3.orig/net/core/stream.c 2006-03-08 11:08:28.0 -0800 +++ linux-2.6.16-rc5mm3/net/core/stream.c 2006-03-08 11:39:40.0 -0800 @@ -242,7 +242,7 @@ int sk_stream_mem_schedule(struct sock * return 1; if (!*sk-sk_prot-memory_pressure || - sk-sk_prot-sysctl_mem[2
Re: [patch 2/4] net: percpufy frequently used vars -- struct proto.memory_allocated
On Tue, Mar 07, 2006 at 07:22:34PM -0800, Andrew Morton wrote: Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: The problem is percpu_counter_sum has to read all the cpus cachelines. If we have to use percpu_counter_sum everywhere, then might as well use plain per-cpu counters instead of batching ones no? I didn't say use it everywhere ;) :) No you did not. Sorry. When the counter shows large varience it will affect all places where it is read and it wasn't clear to me where to use percpu_counter_sum vs percpu_counter_read in the context of the patch. But yes, using percpu_counter_sum at the critical points makes things better. Thanks for pointing it out. Here is an attempt on those lines. Maybe, on very large cpu counts, we should just change the FBC_BATCH so that variance does not go quadratic. Something like 32. So that variance is 32 * NR_CPUS in that case, instead of (NR_CPUS * NR_CPUS * 2) currently. Comments? Sure, we need to make that happen. But it got all mixed up with the spinlock removal and it does need quite some thought and testing and documentation to help developers to choose the right settings and appropriate selection of defaults, etc. I was thinking on the lines of something like the following as a simple fix to the quadratic variance. But yes, it needs some experimentation before hand.. #if NR_CPUS = 16 #define FBC_BATCH (32) #else #define FBC_BATCH (NR_CPUS*4) #endif Thanks, Kiran Change struct proto-memory_allocated to a batching per-CPU counter (percpu_counter) from an atomic_t. A batching counter is better than a plain per-CPU counter as this field is read often. Changes since last post: Use the more accurate percpu_counter_sum() at critical places. Signed-off-by: Pravin B. Shelar [EMAIL PROTECTED] Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.16-rc5mm3/include/net/sock.h === --- linux-2.6.16-rc5mm3.orig/include/net/sock.h 2006-03-08 10:36:07.0 -0800 +++ linux-2.6.16-rc5mm3/include/net/sock.h 2006-03-08 11:03:19.0 -0800 @@ -48,6 +48,7 @@ #include linux/netdevice.h #include linux/skbuff.h /* struct sk_buff */ #include linux/security.h +#include linux/percpu_counter.h #include linux/filter.h @@ -541,8 +542,9 @@ struct proto { /* Memory pressure */ void(*enter_memory_pressure)(void); - atomic_t*memory_allocated; /* Current allocated memory. */ + struct percpu_counter *memory_allocated; /* Current allocated memory. */ atomic_t*sockets_allocated; /* Current number of sockets. */ + /* * Pressure flag: try to collapse. * Technical note: it is used by multiple contexts non atomically. Index: linux-2.6.16-rc5mm3/include/net/tcp.h === --- linux-2.6.16-rc5mm3.orig/include/net/tcp.h 2006-03-08 10:36:07.0 -0800 +++ linux-2.6.16-rc5mm3/include/net/tcp.h 2006-03-08 11:03:19.0 -0800 @@ -225,7 +225,7 @@ extern int sysctl_tcp_abc; extern int sysctl_tcp_mtu_probing; extern int sysctl_tcp_base_mss; -extern atomic_t tcp_memory_allocated; +extern struct percpu_counter tcp_memory_allocated; extern atomic_t tcp_sockets_allocated; extern int tcp_memory_pressure; Index: linux-2.6.16-rc5mm3/net/core/sock.c === --- linux-2.6.16-rc5mm3.orig/net/core/sock.c2006-03-08 10:36:07.0 -0800 +++ linux-2.6.16-rc5mm3/net/core/sock.c 2006-03-08 11:03:19.0 -0800 @@ -1616,12 +1616,13 @@ static char proto_method_implemented(con static void proto_seq_printf(struct seq_file *seq, struct proto *proto) { - seq_printf(seq, %-9s %4u %6d %6d %-3s %6u %-3s %-10s + seq_printf(seq, %-9s %4u %6d %6lu %-3s %6u %-3s %-10s %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n, proto-name, proto-obj_size, proto-sockets_allocated != NULL ? atomic_read(proto-sockets_allocated) : -1, - proto-memory_allocated != NULL ? atomic_read(proto-memory_allocated) : -1, + proto-memory_allocated != NULL ? + percpu_counter_read_positive(proto-memory_allocated) : -1, proto-memory_pressure != NULL ? *proto-memory_pressure ? yes : no : NI, proto-max_header, proto-slab == NULL ? no : yes, Index: linux-2.6.16-rc5mm3/net/core/stream.c === --- linux-2.6.16-rc5mm3.orig/net/core/stream.c 2006-03-08 10:36:07.0 -0800 +++ linux-2.6.16-rc5mm3/net/core/stream.c 2006-03-08 11:08
Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh
On Wed, Mar 08, 2006 at 04:17:33PM -0500, Benjamin LaHaise wrote: On Wed, Mar 08, 2006 at 01:07:26PM -0800, Ravikiran G Thirumalai wrote: Last time I checked, all the major architectures had efficient local_t implementations. Most of the RISC CPUs are able to do a load / store conditional implementation that is the same cost (since memory barriers tend to be explicite on powerpc). So why not use it? Then, for the batched percpu_counters, we could gain by using local_t only for the UP case. But we will have to have a new local_long_t implementation for that. Do you think just one use case of local_long_t warrants for a new set of apis? Kiran - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh
Add percpu_counter_mod_bh for using these counters safely from both softirq and process context. Signed-off by: Pravin B. Shelar [EMAIL PROTECTED] Signed-off by: Ravikiran G Thirumalai [EMAIL PROTECTED] Signed-off by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.16-rc5mm3/include/linux/percpu_counter.h === --- linux-2.6.16-rc5mm3.orig/include/linux/percpu_counter.h 2006-03-07 15:08:00.0 -0800 +++ linux-2.6.16-rc5mm3/include/linux/percpu_counter.h 2006-03-07 15:09:21.0 -0800 @@ -11,6 +11,7 @@ #include linux/smp.h #include linux/threads.h #include linux/percpu.h +#include linux/interrupt.h #ifdef CONFIG_SMP @@ -110,4 +111,11 @@ static inline void percpu_counter_dec(st percpu_counter_mod(fbc, -1); } +static inline void percpu_counter_mod_bh(struct percpu_counter *fbc, long amount) +{ + local_bh_disable(); + percpu_counter_mod(fbc, amount); + local_bh_enable(); +} + #endif /* _LINUX_PERCPU_COUNTER_H */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 2/4] net: percpufy frequently used vars -- struct proto.memory_allocated
Change struct proto-memory_allocated to a batching per-CPU counter (percpu_counter) from an atomic_t. A batching counter is better than a plain per-CPU counter as this field is read often. Signed-off-by: Pravin B. Shelar [EMAIL PROTECTED] Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.16-rc5mm3/include/net/sock.h === --- linux-2.6.16-rc5mm3.orig/include/net/sock.h 2006-03-07 15:07:43.0 -0800 +++ linux-2.6.16-rc5mm3/include/net/sock.h 2006-03-07 15:11:48.0 -0800 @@ -48,6 +48,7 @@ #include linux/netdevice.h #include linux/skbuff.h /* struct sk_buff */ #include linux/security.h +#include linux/percpu_counter.h #include linux/filter.h @@ -541,8 +542,9 @@ struct proto { /* Memory pressure */ void(*enter_memory_pressure)(void); - atomic_t*memory_allocated; /* Current allocated memory. */ + struct percpu_counter *memory_allocated; /* Current allocated memory. */ atomic_t*sockets_allocated; /* Current number of sockets. */ + /* * Pressure flag: try to collapse. * Technical note: it is used by multiple contexts non atomically. Index: linux-2.6.16-rc5mm3/include/net/tcp.h === --- linux-2.6.16-rc5mm3.orig/include/net/tcp.h 2006-03-07 15:08:00.0 -0800 +++ linux-2.6.16-rc5mm3/include/net/tcp.h 2006-03-07 15:11:48.0 -0800 @@ -225,7 +225,7 @@ extern int sysctl_tcp_abc; extern int sysctl_tcp_mtu_probing; extern int sysctl_tcp_base_mss; -extern atomic_t tcp_memory_allocated; +extern struct percpu_counter tcp_memory_allocated; extern atomic_t tcp_sockets_allocated; extern int tcp_memory_pressure; Index: linux-2.6.16-rc5mm3/net/core/sock.c === --- linux-2.6.16-rc5mm3.orig/net/core/sock.c2006-03-07 15:07:43.0 -0800 +++ linux-2.6.16-rc5mm3/net/core/sock.c 2006-03-07 15:11:48.0 -0800 @@ -1616,12 +1616,13 @@ static char proto_method_implemented(con static void proto_seq_printf(struct seq_file *seq, struct proto *proto) { - seq_printf(seq, %-9s %4u %6d %6d %-3s %6u %-3s %-10s + seq_printf(seq, %-9s %4u %6d %6lu %-3s %6u %-3s %-10s %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n, proto-name, proto-obj_size, proto-sockets_allocated != NULL ? atomic_read(proto-sockets_allocated) : -1, - proto-memory_allocated != NULL ? atomic_read(proto-memory_allocated) : -1, + proto-memory_allocated != NULL ? + percpu_counter_read_positive(proto-memory_allocated) : -1, proto-memory_pressure != NULL ? *proto-memory_pressure ? yes : no : NI, proto-max_header, proto-slab == NULL ? no : yes, Index: linux-2.6.16-rc5mm3/net/core/stream.c === --- linux-2.6.16-rc5mm3.orig/net/core/stream.c 2006-03-07 15:07:43.0 -0800 +++ linux-2.6.16-rc5mm3/net/core/stream.c 2006-03-07 15:11:48.0 -0800 @@ -196,11 +196,11 @@ EXPORT_SYMBOL(sk_stream_error); void __sk_stream_mem_reclaim(struct sock *sk) { if (sk-sk_forward_alloc = SK_STREAM_MEM_QUANTUM) { - atomic_sub(sk-sk_forward_alloc / SK_STREAM_MEM_QUANTUM, - sk-sk_prot-memory_allocated); + percpu_counter_mod_bh(sk-sk_prot-memory_allocated, + -(sk-sk_forward_alloc / SK_STREAM_MEM_QUANTUM)); sk-sk_forward_alloc = SK_STREAM_MEM_QUANTUM - 1; if (*sk-sk_prot-memory_pressure - (atomic_read(sk-sk_prot-memory_allocated) + (percpu_counter_read(sk-sk_prot-memory_allocated) sk-sk_prot-sysctl_mem[0])) *sk-sk_prot-memory_pressure = 0; } @@ -213,23 +213,26 @@ int sk_stream_mem_schedule(struct sock * int amt = sk_stream_pages(size); sk-sk_forward_alloc += amt * SK_STREAM_MEM_QUANTUM; - atomic_add(amt, sk-sk_prot-memory_allocated); + percpu_counter_mod_bh(sk-sk_prot-memory_allocated, amt); /* Under limit. */ - if (atomic_read(sk-sk_prot-memory_allocated) sk-sk_prot-sysctl_mem[0]) { + if (percpu_counter_read(sk-sk_prot-memory_allocated) + sk-sk_prot-sysctl_mem[0]) { if (*sk-sk_prot-memory_pressure) *sk-sk_prot-memory_pressure = 0; return 1; } /* Over hard limit. */ - if (atomic_read(sk-sk_prot-memory_allocated)
[patch 3/4] net: percpufy frequently used vars -- proto.sockets_allocated
Change the atomic_t sockets_allocated member of struct proto to a per-cpu counter. Signed-off-by: Pravin B. Shelar [EMAIL PROTECTED] Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.16-rc5mm3/include/net/sock.h === --- linux-2.6.16-rc5mm3.orig/include/net/sock.h 2006-03-07 15:09:22.0 -0800 +++ linux-2.6.16-rc5mm3/include/net/sock.h 2006-03-07 15:09:52.0 -0800 @@ -543,7 +543,7 @@ struct proto { /* Memory pressure */ void(*enter_memory_pressure)(void); struct percpu_counter *memory_allocated; /* Current allocated memory. */ - atomic_t*sockets_allocated; /* Current number of sockets. */ + int *sockets_allocated; /* Current number of sockets (percpu counter). */ /* * Pressure flag: try to collapse. @@ -579,6 +579,24 @@ struct proto { } stats[NR_CPUS]; }; +static inline int read_sockets_allocated(struct proto *prot) +{ + int total = 0; + int cpu; + for_each_cpu(cpu) + total += *per_cpu_ptr(prot-sockets_allocated, cpu); + return total; +} + +static inline void mod_sockets_allocated(int *sockets_allocated, int count) +{ + (*per_cpu_ptr(sockets_allocated, get_cpu())) += count; + put_cpu(); +} + +#define inc_sockets_allocated(c) mod_sockets_allocated(c, 1) +#define dec_sockets_allocated(c) mod_sockets_allocated(c, -1) + extern int proto_register(struct proto *prot, int alloc_slab); extern void proto_unregister(struct proto *prot); Index: linux-2.6.16-rc5mm3/include/net/tcp.h === --- linux-2.6.16-rc5mm3.orig/include/net/tcp.h 2006-03-07 15:09:22.0 -0800 +++ linux-2.6.16-rc5mm3/include/net/tcp.h 2006-03-07 15:09:52.0 -0800 @@ -226,7 +226,7 @@ extern int sysctl_tcp_mtu_probing; extern int sysctl_tcp_base_mss; extern struct percpu_counter tcp_memory_allocated; -extern atomic_t tcp_sockets_allocated; +extern int *tcp_sockets_allocated; extern int tcp_memory_pressure; /* Index: linux-2.6.16-rc5mm3/net/core/sock.c === --- linux-2.6.16-rc5mm3.orig/net/core/sock.c2006-03-07 15:09:22.0 -0800 +++ linux-2.6.16-rc5mm3/net/core/sock.c 2006-03-07 15:09:52.0 -0800 @@ -771,7 +771,7 @@ struct sock *sk_clone(const struct sock newsk-sk_sleep = NULL; if (newsk-sk_prot-sockets_allocated) - atomic_inc(newsk-sk_prot-sockets_allocated); + inc_sockets_allocated(newsk-sk_prot-sockets_allocated); } out: return newsk; @@ -1620,7 +1620,7 @@ static void proto_seq_printf(struct seq_ %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n, proto-name, proto-obj_size, - proto-sockets_allocated != NULL ? atomic_read(proto-sockets_allocated) : -1, + proto-sockets_allocated != NULL ? read_sockets_allocated(proto) : -1, proto-memory_allocated != NULL ? percpu_counter_read_positive(proto-memory_allocated) : -1, proto-memory_pressure != NULL ? *proto-memory_pressure ? yes : no : NI, Index: linux-2.6.16-rc5mm3/net/core/stream.c === --- linux-2.6.16-rc5mm3.orig/net/core/stream.c 2006-03-07 15:09:22.0 -0800 +++ linux-2.6.16-rc5mm3/net/core/stream.c 2006-03-07 15:09:52.0 -0800 @@ -242,7 +242,7 @@ int sk_stream_mem_schedule(struct sock * return 1; if (!*sk-sk_prot-memory_pressure || - sk-sk_prot-sysctl_mem[2] atomic_read(sk-sk_prot-sockets_allocated) * + sk-sk_prot-sysctl_mem[2] read_sockets_allocated(sk-sk_prot) * sk_stream_pages(sk-sk_wmem_queued + atomic_read(sk-sk_rmem_alloc) + sk-sk_forward_alloc)) Index: linux-2.6.16-rc5mm3/net/ipv4/proc.c === --- linux-2.6.16-rc5mm3.orig/net/ipv4/proc.c2006-03-07 15:09:22.0 -0800 +++ linux-2.6.16-rc5mm3/net/ipv4/proc.c 2006-03-07 15:09:52.0 -0800 @@ -63,7 +63,7 @@ static int sockstat_seq_show(struct seq_ socket_seq_show(seq); seq_printf(seq, TCP: inuse %d orphan %d tw %d alloc %d mem %lu\n, fold_prot_inuse(tcp_prot), atomic_read(tcp_orphan_count), - tcp_death_row.tw_count, atomic_read(tcp_sockets_allocated), + tcp_death_row.tw_count, read_sockets_allocated(tcp_prot),
Re: [patch 2/4] net: percpufy frequently used vars -- struct proto.memory_allocated
On Tue, Mar 07, 2006 at 06:14:22PM -0800, Andrew Morton wrote: Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: - if (atomic_read(sk-sk_prot-memory_allocated) sk-sk_prot-sysctl_mem[0]) { + if (percpu_counter_read(sk-sk_prot-memory_allocated) + sk-sk_prot-sysctl_mem[0]) { Bear in mind that percpu_counter_read[_positive] can be inaccurate on large CPU counts. It might be worth running percpu_counter_sum() to get the exact count if we think we're about to cause something to fail. The problem is percpu_counter_sum has to read all the cpus cachelines. If we have to use percpu_counter_sum everywhere, then might as well use plain per-cpu counters instead of batching ones no? sysctl_mem[0] is about 196K and on a 16 cpu box variance is 512 bytes, which is OK with just percpu_counter_read I hope. Maybe, on very large cpu counts, we should just change the FBC_BATCH so that variance does not go quadratic. Something like 32. So that variance is 32 * NR_CPUS in that case, instead of (NR_CPUS * NR_CPUS * 2) currently. Comments? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Fri, Jan 27, 2006 at 03:01:06PM -0800, Andrew Morton wrote: Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: If the benchmarks say that we need to. If we cannot observe any problems in testing of existing code and if we can't demonstrate any benefit from the patched code then one option is to go off and do something else ;) We first tried plain per-CPU counters for memory_allocated, found that reads on memory_allocated was causing cacheline transfers, and then switched over to batching. So batching reads is useful. To avoid inaccuracy, we can maybe change percpu_counter_init to: void percpu_counter_init(struct percpu_counter *fbc, int maxdev) the percpu batching limit would then be maxdev/num_possible_cpus. One would use batching counters only when both reads and writes are frequent. With the above scheme, we would go fetch cachelines from other cpus for read often only on large cpu counts, which is not any worse than the global counter alternative, but it would still be beneficial on smaller machines, without sacrificing a pre-set deviation. Comments? Sounds sane. Here's an implementation which delegates tuning of batching to the user. We don't really need local_t at all as percpu_counter_mod is not safe against interrupts and softirqs as it is. If we have a counter which could be modified in process context and irq/bh context, we just have to use a wrapper like percpu_counter_mod_bh which will just disable and enable bottom halves. Reads on the counters are safe as they are atomic_reads, and the cpu local variables are always accessed by that cpu only. (PS: the maxerr for ext2/ext3 is just guesstimate) Comments? Index: linux-2.6.16-rc1mm4/include/linux/percpu_counter.h === --- linux-2.6.16-rc1mm4.orig/include/linux/percpu_counter.h 2006-02-02 11:18:54.0 -0800 +++ linux-2.6.16-rc1mm4/include/linux/percpu_counter.h 2006-02-02 18:29:46.0 -0800 @@ -16,24 +16,32 @@ struct percpu_counter { atomic_long_t count; + int percpu_batch; long *counters; }; -#if NR_CPUS = 16 -#define FBC_BATCH (NR_CPUS*2) -#else -#define FBC_BATCH (NR_CPUS*4) -#endif -static inline void percpu_counter_init(struct percpu_counter *fbc) +/* + * Choose maxerr carefully. maxerr/num_possible_cpus indicates per-cpu batching + * Set maximum tolerance for better performance on large systems. + */ +static inline void percpu_counter_init(struct percpu_counter *fbc, + unsigned int maxerr) { atomic_long_set(fbc-count, 0); - fbc-counters = alloc_percpu(long); + fbc-percpu_batch = maxerr/num_possible_cpus(); + if (fbc-percpu_batch) { + fbc-counters = alloc_percpu(long); + if (!fbc-counters) + fbc-percpu_batch = 0; + } + } static inline void percpu_counter_destroy(struct percpu_counter *fbc) { - free_percpu(fbc-counters); + if (fbc-percpu_batch) + free_percpu(fbc-counters); } void percpu_counter_mod(struct percpu_counter *fbc, long amount); @@ -63,7 +71,8 @@ struct percpu_counter { long count; }; -static inline void percpu_counter_init(struct percpu_counter *fbc) +static inline void percpu_counter_init(struct percpu_counter *fbc, + unsigned int maxerr) { fbc-count = 0; } Index: linux-2.6.16-rc1mm4/mm/swap.c === --- linux-2.6.16-rc1mm4.orig/mm/swap.c 2006-01-29 20:20:20.0 -0800 +++ linux-2.6.16-rc1mm4/mm/swap.c 2006-02-02 18:36:21.0 -0800 @@ -470,13 +470,20 @@ static int cpu_swap_callback(struct noti #ifdef CONFIG_SMP void percpu_counter_mod(struct percpu_counter *fbc, long amount) { - long count; long *pcount; - int cpu = get_cpu(); + long count; + int cpu; + /* Slow mode */ + if (unlikely(!fbc-percpu_batch)) { + atomic_long_add(amount, fbc-count); + return; + } + + cpu = get_cpu(); pcount = per_cpu_ptr(fbc-counters, cpu); count = *pcount + amount; - if (count = FBC_BATCH || count = -FBC_BATCH) { + if (count = fbc-percpu_batch || count = -fbc-percpu_batch) { atomic_long_add(count, fbc-count); count = 0; } Index: linux-2.6.16-rc1mm4/fs/ext2/super.c === --- linux-2.6.16-rc1mm4.orig/fs/ext2/super.c2006-02-02 18:30:28.0 -0800 +++ linux-2.6.16-rc1mm4/fs/ext2/super.c 2006-02-02 18:36:39.0 -0800 @@ -610,6 +610,7 @@ static int ext2_fill_super(struct super_ int db_count; int i, j; __le32 features; + int maxerr; sbi = kmalloc(sizeof(*sbi), GFP_KERNEL
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Fri, Jan 27, 2006 at 09:53:53AM +0100, Eric Dumazet wrote: Ravikiran G Thirumalai a écrit : Change the atomic_t sockets_allocated member of struct proto to a per-cpu counter. Signed-off-by: Pravin B. Shelar [EMAIL PROTECTED] Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Hi Ravikiran If I correctly read this patch, I think there is a scalability problem. On a big SMP machine, read_sockets_allocated() is going to be a real killer. Say we have 128 Opterons CPUS in a box. read_sockets_allocated is being invoked when when /proc/net/protocols is read, which can be assumed as not frequent. At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only certain conditions, under memory pressure -- on a large CPU count machine, you'd have large memory, and I don't think read_sockets_allocated would get called often. It did not atleast on our 8cpu/16G box. So this should be OK I think. There're no 128 CPU Opteron boxes yet afaik ;). Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote: Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: which can be assumed as not frequent. At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only certain conditions, under memory pressure -- on a large CPU count machine, you'd have large memory, and I don't think read_sockets_allocated would get called often. It did not atleast on our 8cpu/16G box. So this should be OK I think. That being said, the percpu_counters aren't a terribly successful concept and probably do need a revisit due to the high inaccuracy at high CPU counts. It might be better to do some generic version of vm_acct_memory() instead. AFAICS vm_acct_memory is no better. The deviation on large cpu counts is the same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ... If the benchmarks say that we need to. If we cannot observe any problems in testing of existing code and if we can't demonstrate any benefit from the patched code then one option is to go off and do something else ;) We first tried plain per-CPU counters for memory_allocated, found that reads on memory_allocated was causing cacheline transfers, and then switched over to batching. So batching reads is useful. To avoid inaccuracy, we can maybe change percpu_counter_init to: void percpu_counter_init(struct percpu_counter *fbc, int maxdev) the percpu batching limit would then be maxdev/num_possible_cpus. One would use batching counters only when both reads and writes are frequent. With the above scheme, we would go fetch cachelines from other cpus for read often only on large cpu counts, which is not any worse than the global counter alternative, but it would still be beneficial on smaller machines, without sacrificing a pre-set deviation. Comments? Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Fri, Jan 27, 2006 at 11:30:23PM +0100, Eric Dumazet wrote: There are several issues here : alloc_percpu() current implementation is a a waste of ram. (because it uses slab allocations that have a minimum size of 32 bytes) Oh there was a solution for that :). http://lwn.net/Articles/119532/ I can quickly revive it if there is interest. Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Fri, Jan 27, 2006 at 03:08:47PM -0800, Andrew Morton wrote: Andrew Morton [EMAIL PROTECTED] wrote: Oh, and because vm_acct_memory() is counting a singleton object, it can use DEFINE_PER_CPU rather than alloc_percpu(), so it saves on a bit of kmalloc overhead. Actually, I don't think that's true. we're allocating a sizeof(long) with kmalloc_node() so there shouldn't be memory wastage. Oh yeah there is. Each dynamic per-cpu object would have been atleast (NR_CPUS * sizeof (void *) + num_cpus_possible * cacheline_size ). Now kmalloc_node will fall back on size-32 for allocation of long, so replace the cacheline_size above with 32 -- which then means dynamic per-cpu data are not on a cacheline boundary anymore (most modern cpus have 64byte/128 byte cache lines) which means per-cpu data could end up false shared Kiran - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Sat, Jan 28, 2006 at 12:21:07AM +0100, Eric Dumazet wrote: Ravikiran G Thirumalai a écrit : On Fri, Jan 27, 2006 at 11:30:23PM +0100, Eric Dumazet wrote: Why not use a boot time allocated percpu area (as done today in setup_per_cpu_areas()), but instead of reserving extra space for module's percpu data, being able to serve alloc_percpu() from this reserved area (ie no kmalloced data anymore), and keeping your At that time ia64 placed a limit on the max size of per-CPU area (PERCPU_ENOUGH_ROOM). I think that the limit is still there, But hopefully 64K per-CPU should be enough for static + dynamic + modules? Let me do a allyesconfig on my box and verify. Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Sat, Jan 28, 2006 at 01:35:03AM +0100, Eric Dumazet wrote: Eric Dumazet a écrit : Andrew Morton a écrit : Eric Dumazet [EMAIL PROTECTED] wrote: #ifdef CONFIG_SMP void percpu_counter_mod(struct percpu_counter *fbc, long amount) { long old, new; atomic_long_t *pcount; pcount = per_cpu_ptr(fbc-counters, get_cpu()); start: old = atomic_long_read(pcount); new = old + amount; if (new = FBC_BATCH || new = -FBC_BATCH) { if (unlikely(atomic_long_cmpxchg(pcount, old, 0) != old)) goto start; atomic_long_add(new, fbc-count); } else atomic_long_add(amount, pcount); put_cpu(); } EXPORT_SYMBOL(percpu_counter_mod); long percpu_counter_read_accurate(struct percpu_counter *fbc) { long res = 0; int cpu; atomic_long_t *pcount; for_each_cpu(cpu) { pcount = per_cpu_ptr(fbc-counters, cpu); /* dont dirty cache line if not necessary */ if (atomic_long_read(pcount)) res += atomic_long_xchg(pcount, 0); --- (A) } atomic_long_add(res, fbc-count); --- (B) res = atomic_long_read(fbc-count); return res; } The read is still theoritically FBC_BATCH * NR_CPUS inaccurate no? What happens when other cpus update their local counters at (A) and (B)? (I am hoping we don't need percpu_counter_read_accurate anywhere yet and this is just demo ;). I certainly don't want to go on all cpus for read / add cmpxchg on the write path for the proto counters that started this discussion) Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 0/4] net: Percpufy frequently used variables on struct proto
The following patches change struct proto.memory_allocated, proto.sockets_allocated to use per-cpu counters. This patchset also switches the proto.inuse percpu varible to use alloc_percpu, instead of NR_CPUS * cacheline size padding. We saw 5 % improvement in apache bench requests per second with this patchset, on a multi nic 8 way 3.3 GHZ IBM x460 Xeon server. Patches follow. Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Change the atomic_t sockets_allocated member of struct proto to a per-cpu counter. Signed-off-by: Pravin B. Shelar [EMAIL PROTECTED] Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.16-rc1/include/net/sock.h === --- linux-2.6.16-rc1.orig/include/net/sock.h2006-01-24 14:37:45.0 -0800 +++ linux-2.6.16-rc1/include/net/sock.h 2006-01-24 17:35:34.0 -0800 @@ -543,7 +543,7 @@ struct proto { /* Memory pressure */ void(*enter_memory_pressure)(void); struct percpu_counter *memory_allocated; /* Current allocated memory. */ - atomic_t*sockets_allocated; /* Current number of sockets. */ + int *sockets_allocated; /* Current number of sockets (percpu counter). */ /* * Pressure flag: try to collapse. @@ -579,6 +579,24 @@ struct proto { } stats[NR_CPUS]; }; +static inline int read_sockets_allocated(struct proto *prot) +{ + int total = 0; + int cpu; + for_each_cpu(cpu) + total += *per_cpu_ptr(prot-sockets_allocated, cpu); + return total; +} + +static inline void mod_sockets_allocated(int *sockets_allocated, int count) +{ + (*per_cpu_ptr(sockets_allocated, get_cpu())) += count; + put_cpu(); +} + +#define inc_sockets_allocated(c) mod_sockets_allocated(c, 1) +#define dec_sockets_allocated(c) mod_sockets_allocated(c, -1) + extern int proto_register(struct proto *prot, int alloc_slab); extern void proto_unregister(struct proto *prot); Index: linux-2.6.16-rc1/include/net/tcp.h === --- linux-2.6.16-rc1.orig/include/net/tcp.h 2006-01-24 14:37:45.0 -0800 +++ linux-2.6.16-rc1/include/net/tcp.h 2006-01-24 17:05:34.0 -0800 @@ -221,7 +221,7 @@ extern int sysctl_tcp_tso_win_divisor; extern int sysctl_tcp_abc; extern struct percpu_counter tcp_memory_allocated; -extern atomic_t tcp_sockets_allocated; +extern int *tcp_sockets_allocated; extern int tcp_memory_pressure; /* Index: linux-2.6.16-rc1/net/core/sock.c === --- linux-2.6.16-rc1.orig/net/core/sock.c 2006-01-24 14:37:45.0 -0800 +++ linux-2.6.16-rc1/net/core/sock.c2006-01-24 16:47:55.0 -0800 @@ -771,7 +771,7 @@ struct sock *sk_clone(const struct sock newsk-sk_sleep = NULL; if (newsk-sk_prot-sockets_allocated) - atomic_inc(newsk-sk_prot-sockets_allocated); + inc_sockets_allocated(newsk-sk_prot-sockets_allocated); } out: return newsk; @@ -1620,7 +1620,7 @@ static void proto_seq_printf(struct seq_ %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n, proto-name, proto-obj_size, - proto-sockets_allocated != NULL ? atomic_read(proto-sockets_allocated) : -1, + proto-sockets_allocated != NULL ? read_sockets_allocated(proto) : -1, proto-memory_allocated != NULL ? percpu_counter_read(proto-memory_allocated) : -1, proto-memory_pressure != NULL ? *proto-memory_pressure ? yes : no : NI, Index: linux-2.6.16-rc1/net/core/stream.c === --- linux-2.6.16-rc1.orig/net/core/stream.c 2006-01-24 14:37:45.0 -0800 +++ linux-2.6.16-rc1/net/core/stream.c 2006-01-24 16:20:22.0 -0800 @@ -243,7 +243,7 @@ int sk_stream_mem_schedule(struct sock * return 1; if (!*sk-sk_prot-memory_pressure || - sk-sk_prot-sysctl_mem[2] atomic_read(sk-sk_prot-sockets_allocated) * + sk-sk_prot-sysctl_mem[2] read_sockets_allocated(sk-sk_prot) * sk_stream_pages(sk-sk_wmem_queued + atomic_read(sk-sk_rmem_alloc) + sk-sk_forward_alloc)) Index: linux-2.6.16-rc1/net/ipv4/proc.c === --- linux-2.6.16-rc1.orig/net/ipv4/proc.c 2006-01-24 14:37:45.0 -0800 +++ linux-2.6.16-rc1/net/ipv4/proc.c2006-01-24 16:20:22.0 -0800 @@ -63,7 +63,7 @@ static int sockstat_seq_show(struct seq_ socket_seq_show(seq); seq_printf(seq, TCP: inuse %d orphan %d tw %d alloc %d mem %lu\n, fold_prot_inuse(tcp_prot), atomic_read(tcp_orphan_count), - tcp_death_row.tw_count, atomic_read(tcp_sockets_allocated), + tcp_death_row.tw_count, read_sockets_allocated(tcp_prot), percpu_counter_read(tcp_memory_allocated));
[patch 2/4] net: Percpufy frequently used variables -- struct proto.memory_allocated
Change struct proto-memory_allocated to a batching per-CPU counter (percpu_counter) from an atomic_t. A batching counter is better than a plain per-CPU counter as this field is read often. Signed-off-by: Pravin B. Shelar [EMAIL PROTECTED] Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.16-rc1mm3/include/net/sock.h === --- linux-2.6.16-rc1mm3.orig/include/net/sock.h 2006-01-25 16:25:16.0 -0800 +++ linux-2.6.16-rc1mm3/include/net/sock.h 2006-01-25 16:56:59.0 -0800 @@ -48,6 +48,7 @@ #include linux/netdevice.h #include linux/skbuff.h /* struct sk_buff */ #include linux/security.h +#include linux/percpu_counter.h #include linux/filter.h @@ -541,8 +542,9 @@ struct proto { /* Memory pressure */ void(*enter_memory_pressure)(void); - atomic_t*memory_allocated; /* Current allocated memory. */ + struct percpu_counter *memory_allocated; /* Current allocated memory. */ atomic_t*sockets_allocated; /* Current number of sockets. */ + /* * Pressure flag: try to collapse. * Technical note: it is used by multiple contexts non atomically. Index: linux-2.6.16-rc1mm3/include/net/tcp.h === --- linux-2.6.16-rc1mm3.orig/include/net/tcp.h 2006-01-25 16:25:16.0 -0800 +++ linux-2.6.16-rc1mm3/include/net/tcp.h 2006-01-25 16:56:59.0 -0800 @@ -220,7 +220,7 @@ extern int sysctl_tcp_moderate_rcvbuf; extern int sysctl_tcp_tso_win_divisor; extern int sysctl_tcp_abc; -extern atomic_t tcp_memory_allocated; +extern struct percpu_counter tcp_memory_allocated; extern atomic_t tcp_sockets_allocated; extern int tcp_memory_pressure; Index: linux-2.6.16-rc1mm3/net/core/sock.c === --- linux-2.6.16-rc1mm3.orig/net/core/sock.c2006-01-25 16:25:16.0 -0800 +++ linux-2.6.16-rc1mm3/net/core/sock.c 2006-01-25 16:56:59.0 -0800 @@ -1616,12 +1616,13 @@ static char proto_method_implemented(con static void proto_seq_printf(struct seq_file *seq, struct proto *proto) { - seq_printf(seq, %-9s %4u %6d %6d %-3s %6u %-3s %-10s + seq_printf(seq, %-9s %4u %6d %6lu %-3s %6u %-3s %-10s %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n, proto-name, proto-obj_size, proto-sockets_allocated != NULL ? atomic_read(proto-sockets_allocated) : -1, - proto-memory_allocated != NULL ? atomic_read(proto-memory_allocated) : -1, + proto-memory_allocated != NULL ? + percpu_counter_read(proto-memory_allocated) : -1, proto-memory_pressure != NULL ? *proto-memory_pressure ? yes : no : NI, proto-max_header, proto-slab == NULL ? no : yes, Index: linux-2.6.16-rc1mm3/net/core/stream.c === --- linux-2.6.16-rc1mm3.orig/net/core/stream.c 2006-01-25 16:25:16.0 -0800 +++ linux-2.6.16-rc1mm3/net/core/stream.c 2006-01-25 16:56:59.0 -0800 @@ -17,6 +17,7 @@ #include linux/signal.h #include linux/tcp.h #include linux/wait.h +#include linux/percpu.h #include net/sock.h /** @@ -196,11 +197,11 @@ EXPORT_SYMBOL(sk_stream_error); void __sk_stream_mem_reclaim(struct sock *sk) { if (sk-sk_forward_alloc = SK_STREAM_MEM_QUANTUM) { - atomic_sub(sk-sk_forward_alloc / SK_STREAM_MEM_QUANTUM, - sk-sk_prot-memory_allocated); + percpu_counter_mod_bh(sk-sk_prot-memory_allocated, + -(sk-sk_forward_alloc / SK_STREAM_MEM_QUANTUM)); sk-sk_forward_alloc = SK_STREAM_MEM_QUANTUM - 1; if (*sk-sk_prot-memory_pressure - (atomic_read(sk-sk_prot-memory_allocated) + (percpu_counter_read(sk-sk_prot-memory_allocated) sk-sk_prot-sysctl_mem[0])) *sk-sk_prot-memory_pressure = 0; } @@ -213,23 +214,26 @@ int sk_stream_mem_schedule(struct sock * int amt = sk_stream_pages(size); sk-sk_forward_alloc += amt * SK_STREAM_MEM_QUANTUM; - atomic_add(amt, sk-sk_prot-memory_allocated); + percpu_counter_mod_bh(sk-sk_prot-memory_allocated, amt); /* Under limit. */ - if (atomic_read(sk-sk_prot-memory_allocated) sk-sk_prot-sysctl_mem[0]) { + if (percpu_counter_read(sk-sk_prot-memory_allocated) + sk-sk_prot-sysctl_mem[0]) { if (*sk-sk_prot-memory_pressure) *sk-sk_prot-memory_pressure = 0;
[patch 1/4] net: Percpufy frequently used variables -- add percpu_counter_mod_bh
Add percpu_counter_mod_bh for using these counters safely from both softirq and process context. Signed-off by: Pravin B. Shelar [EMAIL PROTECTED] Signed-off by: Ravikiran G Thirumalai [EMAIL PROTECTED] Signed-off by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.16-rc1/include/linux/percpu_counter.h === --- linux-2.6.16-rc1.orig/include/linux/percpu_counter.h2006-01-25 11:53:56.0 -0800 +++ linux-2.6.16-rc1/include/linux/percpu_counter.h 2006-01-25 12:09:41.0 -0800 @@ -11,6 +11,7 @@ #include linux/smp.h #include linux/threads.h #include linux/percpu.h +#include linux/interrupt.h #ifdef CONFIG_SMP @@ -102,4 +103,11 @@ static inline void percpu_counter_dec(st percpu_counter_mod(fbc, -1); } +static inline void percpu_counter_mod_bh(struct percpu_counter *fbc, long amount) +{ + local_bh_disable(); + percpu_counter_mod(fbc, amount); + local_bh_enable(); +} + #endif /* _LINUX_PERCPU_COUNTER_H */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html