Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > On Thu, Feb 02, 2006 at 07:16:00PM -0800, Andrew Morton wrote: > > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > > > > > On Fri, Jan 27, 2006 at 03:01:06PM -0800, Andrew Morton wrote: > > > Here's an implementation which delegates tuning of batching to the user. > > > We > > > don't really need local_t at all as percpu_counter_mod is not safe against > > > interrupts and softirqs as it is. If we have a counter which could be > > > modified in process context and irq/bh context, we just have to use a > > > wrapper like percpu_counter_mod_bh which will just disable and enable > > > bottom > > > halves. Reads on the counters are safe as they are atomic_reads, and the > > > cpu local variables are always accessed by that cpu only. > > > > > > (PS: the maxerr for ext2/ext3 is just guesstimate) > > > > Well that's the problem. We need to choose production-quality values for > > use in there. > > The guesstimate was loosely based on keeping the per-cpu batch at atleast 8 > on reasonably sized systems, while not letting maxerr grow too big. I guess > machines with cpu counts more than 128 don't use ext3 :). And if they do, > they can tune the counters with a higher maxerr. I guess it might be a bit > ugly on the user side with all the if num_possibl_cpus(), but is there a > better alternative? > > (I plan to test the counter values for ext2 and ext3 on a 16 way box, and > change these if they turn out to be not so good) OK, thanks. Frankly I think I went overboard on the scalability thing when adding percpu counters to ext2 and ext3. I suspect they're not providing significant benefit over per-sb-spinlock and a ulong. > > > > > Comments? > > > > Using num_possible_cpus() in that header file is just asking for build > > errors. Probably best to uninline the function rather than adding the > > needed include of cpumask.h. > > Yup, > > Here it is. > > Change the percpu_counter interface so that user can specify the maximum > tolerable deviation. OK, thanks. I need to sit down and a) remember why we're even discussing this and b) see what we've merged thus far and work out what it all does ;) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Thu, Feb 02, 2006 at 07:16:00PM -0800, Andrew Morton wrote: > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > > > On Fri, Jan 27, 2006 at 03:01:06PM -0800, Andrew Morton wrote: > > Here's an implementation which delegates tuning of batching to the user. We > > don't really need local_t at all as percpu_counter_mod is not safe against > > interrupts and softirqs as it is. If we have a counter which could be > > modified in process context and irq/bh context, we just have to use a > > wrapper like percpu_counter_mod_bh which will just disable and enable bottom > > halves. Reads on the counters are safe as they are atomic_reads, and the > > cpu local variables are always accessed by that cpu only. > > > > (PS: the maxerr for ext2/ext3 is just guesstimate) > > Well that's the problem. We need to choose production-quality values for > use in there. The guesstimate was loosely based on keeping the per-cpu batch at atleast 8 on reasonably sized systems, while not letting maxerr grow too big. I guess machines with cpu counts more than 128 don't use ext3 :). And if they do, they can tune the counters with a higher maxerr. I guess it might be a bit ugly on the user side with all the if num_possibl_cpus(), but is there a better alternative? (I plan to test the counter values for ext2 and ext3 on a 16 way box, and change these if they turn out to be not so good) > > > Comments? > > Using num_possible_cpus() in that header file is just asking for build > errors. Probably best to uninline the function rather than adding the > needed include of cpumask.h. Yup, Here it is. Change the percpu_counter interface so that user can specify the maximum tolerable deviation. Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Index: linux-2.6.16-rc1mm4/include/linux/percpu_counter.h === --- linux-2.6.16-rc1mm4.orig/include/linux/percpu_counter.h 2006-02-02 11:18:54.0 -0800 +++ linux-2.6.16-rc1mm4/include/linux/percpu_counter.h 2006-02-03 11:04:05.0 -0800 @@ -16,24 +16,20 @@ struct percpu_counter { atomic_long_t count; + int percpu_batch; long *counters; }; -#if NR_CPUS >= 16 -#define FBC_BATCH (NR_CPUS*2) -#else -#define FBC_BATCH (NR_CPUS*4) -#endif - -static inline void percpu_counter_init(struct percpu_counter *fbc) -{ - atomic_long_set(&fbc->count, 0); - fbc->counters = alloc_percpu(long); -} +/* + * Choose maxerr carefully. maxerr/num_possible_cpus indicates per-cpu + * batching. Set maximum tolerance for better performance on large systems. + */ +void percpu_counter_init(struct percpu_counter *fbc, unsigned int maxerr); static inline void percpu_counter_destroy(struct percpu_counter *fbc) { - free_percpu(fbc->counters); + if (fbc->percpu_batch) + free_percpu(fbc->counters); } void percpu_counter_mod(struct percpu_counter *fbc, long amount); @@ -63,7 +59,8 @@ struct percpu_counter { long count; }; -static inline void percpu_counter_init(struct percpu_counter *fbc) +static inline void percpu_counter_init(struct percpu_counter *fbc, + unsigned int maxerr) { fbc->count = 0; } Index: linux-2.6.16-rc1mm4/mm/swap.c === --- linux-2.6.16-rc1mm4.orig/mm/swap.c 2006-01-29 20:20:20.0 -0800 +++ linux-2.6.16-rc1mm4/mm/swap.c 2006-02-03 11:02:05.0 -0800 @@ -468,15 +468,39 @@ static int cpu_swap_callback(struct noti #endif /* CONFIG_SMP */ #ifdef CONFIG_SMP + +/* + * Choose maxerr carefully. maxerr/num_possible_cpus indicates per-cpu + * batching. Set maximum tolerance for better performance on large systems. + */ +void percpu_counter_init(struct percpu_counter *fbc, unsigned int maxerr) +{ + atomic_long_set(&fbc->count, 0); + fbc->percpu_batch = maxerr/num_possible_cpus(); + if (fbc->percpu_batch) { + fbc->counters = alloc_percpu(long); + if (!fbc->counters) + fbc->percpu_batch = 0; + } + +} + void percpu_counter_mod(struct percpu_counter *fbc, long amount) { - long count; long *pcount; - int cpu = get_cpu(); + long count; + int cpu; + /* Slow mode */ + if (unlikely(!fbc->percpu_batch)) { + atomic_long_add(amount, &fbc->count); + return; + } + + cpu = get_cpu(); pcount = per_cpu_ptr(fbc->counters, cpu); count = *pcount + amount; - if (count >= FBC_BATCH || count <= -FBC_BATCH) { + if (count >= fbc->percpu_batch || count <= -fbc->percpu_batch) { atomic_long_add(count, &fbc->count); count = 0; } @@ -484,6 +508,7 @@ void percpu_counter_mod(struct percpu_co put_cpu(); } EXPORT_SYMBOL(percpu_counter_mod); +EXPORT_SYMBOL(percpu_counter_init)
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > On Fri, Jan 27, 2006 at 03:01:06PM -0800, Andrew Morton wrote: > > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > If the benchmarks say that we need to. If we cannot observe any > > > > problems > > > > in testing of existing code and if we can't demonstrate any benefit from > > > > the patched code then one option is to go off and do something else ;) > > > > > > We first tried plain per-CPU counters for memory_allocated, found that > > > reads > > > on memory_allocated was causing cacheline transfers, and then > > > switched over to batching. So batching reads is useful. To avoid > > > inaccuracy, we can maybe change percpu_counter_init to: > > > > > > void percpu_counter_init(struct percpu_counter *fbc, int maxdev) > > > > > > the percpu batching limit would then be maxdev/num_possible_cpus. One > > > would > > > use batching counters only when both reads and writes are frequent. With > > > the above scheme, we would go fetch cachelines from other cpus for read > > > often only on large cpu counts, which is not any worse than the global > > > counter alternative, but it would still be beneficial on smaller machines, > > > without sacrificing a pre-set deviation. > > > > > > Comments? > > > > Sounds sane. > > > > Here's an implementation which delegates tuning of batching to the user. We > don't really need local_t at all as percpu_counter_mod is not safe against > interrupts and softirqs as it is. If we have a counter which could be > modified in process context and irq/bh context, we just have to use a > wrapper like percpu_counter_mod_bh which will just disable and enable bottom > halves. Reads on the counters are safe as they are atomic_reads, and the > cpu local variables are always accessed by that cpu only. > > (PS: the maxerr for ext2/ext3 is just guesstimate) Well that's the problem. We need to choose production-quality values for use in there. > Comments? Using num_possible_cpus() in that header file is just asking for build errors. Probably best to uninline the function rather than adding the needed include of cpumask.h. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Fri, Jan 27, 2006 at 03:01:06PM -0800, Andrew Morton wrote: > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > > > > > > > If the benchmarks say that we need to. If we cannot observe any problems > > > in testing of existing code and if we can't demonstrate any benefit from > > > the patched code then one option is to go off and do something else ;) > > > > We first tried plain per-CPU counters for memory_allocated, found that reads > > on memory_allocated was causing cacheline transfers, and then > > switched over to batching. So batching reads is useful. To avoid > > inaccuracy, we can maybe change percpu_counter_init to: > > > > void percpu_counter_init(struct percpu_counter *fbc, int maxdev) > > > > the percpu batching limit would then be maxdev/num_possible_cpus. One would > > use batching counters only when both reads and writes are frequent. With > > the above scheme, we would go fetch cachelines from other cpus for read > > often only on large cpu counts, which is not any worse than the global > > counter alternative, but it would still be beneficial on smaller machines, > > without sacrificing a pre-set deviation. > > > > Comments? > > Sounds sane. > Here's an implementation which delegates tuning of batching to the user. We don't really need local_t at all as percpu_counter_mod is not safe against interrupts and softirqs as it is. If we have a counter which could be modified in process context and irq/bh context, we just have to use a wrapper like percpu_counter_mod_bh which will just disable and enable bottom halves. Reads on the counters are safe as they are atomic_reads, and the cpu local variables are always accessed by that cpu only. (PS: the maxerr for ext2/ext3 is just guesstimate) Comments? Index: linux-2.6.16-rc1mm4/include/linux/percpu_counter.h === --- linux-2.6.16-rc1mm4.orig/include/linux/percpu_counter.h 2006-02-02 11:18:54.0 -0800 +++ linux-2.6.16-rc1mm4/include/linux/percpu_counter.h 2006-02-02 18:29:46.0 -0800 @@ -16,24 +16,32 @@ struct percpu_counter { atomic_long_t count; + int percpu_batch; long *counters; }; -#if NR_CPUS >= 16 -#define FBC_BATCH (NR_CPUS*2) -#else -#define FBC_BATCH (NR_CPUS*4) -#endif -static inline void percpu_counter_init(struct percpu_counter *fbc) +/* + * Choose maxerr carefully. maxerr/num_possible_cpus indicates per-cpu batching + * Set maximum tolerance for better performance on large systems. + */ +static inline void percpu_counter_init(struct percpu_counter *fbc, + unsigned int maxerr) { atomic_long_set(&fbc->count, 0); - fbc->counters = alloc_percpu(long); + fbc->percpu_batch = maxerr/num_possible_cpus(); + if (fbc->percpu_batch) { + fbc->counters = alloc_percpu(long); + if (!fbc->counters) + fbc->percpu_batch = 0; + } + } static inline void percpu_counter_destroy(struct percpu_counter *fbc) { - free_percpu(fbc->counters); + if (fbc->percpu_batch) + free_percpu(fbc->counters); } void percpu_counter_mod(struct percpu_counter *fbc, long amount); @@ -63,7 +71,8 @@ struct percpu_counter { long count; }; -static inline void percpu_counter_init(struct percpu_counter *fbc) +static inline void percpu_counter_init(struct percpu_counter *fbc, + unsigned int maxerr) { fbc->count = 0; } Index: linux-2.6.16-rc1mm4/mm/swap.c === --- linux-2.6.16-rc1mm4.orig/mm/swap.c 2006-01-29 20:20:20.0 -0800 +++ linux-2.6.16-rc1mm4/mm/swap.c 2006-02-02 18:36:21.0 -0800 @@ -470,13 +470,20 @@ static int cpu_swap_callback(struct noti #ifdef CONFIG_SMP void percpu_counter_mod(struct percpu_counter *fbc, long amount) { - long count; long *pcount; - int cpu = get_cpu(); + long count; + int cpu; + /* Slow mode */ + if (unlikely(!fbc->percpu_batch)) { + atomic_long_add(amount, &fbc->count); + return; + } + + cpu = get_cpu(); pcount = per_cpu_ptr(fbc->counters, cpu); count = *pcount + amount; - if (count >= FBC_BATCH || count <= -FBC_BATCH) { + if (count >= fbc->percpu_batch || count <= -fbc->percpu_batch) { atomic_long_add(count, &fbc->count); count = 0; } Index: linux-2.6.16-rc1mm4/fs/ext2/super.c === --- linux-2.6.16-rc1mm4.orig/fs/ext2/super.c2006-02-02 18:30:28.0 -0800 +++ linux-2.6.16-rc1mm4/fs/ext2/super.c 2006-02-02 18:36:39.0 -0800 @@ -610,6 +610,7 @@ static int ext2_fill_super(struct super_ int db_count; int i, j; __le32 features; + i
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Sun, Jan 29, 2006 at 07:54:09AM +0100, Eric Dumazet wrote: > Well, I think that might be doable, maybe RCU magic ? > > 1) local_t are not that nice on all archs. It is for the users that matter, and the hooks are there if someone finds it to be a performance problem. > 2) The consolidation phase (summing all the cpus local offset to > consolidate the central counter) might be more difficult to do (we would > need kind of 2 counters per cpu, and a index that can be changed by the cpu > that wants a consolidation (still 'expensive')) For the vast majority of these sorts of statistics counters, we don't need 100% accurate counts. And I think it should be possible to choose between a lightweight implementation and the expensive implementation. On a chip like the Core Duo the cost of bouncing between the two cores is minimal, so all the extra code and data is a waste. > 3) Are the locked ops so expensive if done on a cache line that is mostly > in exclusive state in cpu cache ? Yes. What happens on the P4 is that it forces outstanding memory transactions in the reorder buffer to be flushed so that the memory barrier semantics of the lock prefix are observed. This can take a long time as there can be over a hundred instructions in flight. -ben -- "Ladies and gentlemen, I'm sorry to interrupt, but the police are here and they've asked us to stop the party." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Benjamin LaHaise a écrit : On Sat, Jan 28, 2006 at 01:28:20AM +0100, Eric Dumazet wrote: We might use atomic_long_t only (and no spinlocks) Something like this ? Erk, complex and slow... Try using local_t instead, which is substantially cheaper on the P4 as it doesn't use the lock prefix and act as a memory barrier. See asm/local.h. Well, I think that might be doable, maybe RCU magic ? 1) local_t are not that nice on all archs. 2) The consolidation phase (summing all the cpus local offset to consolidate the central counter) might be more difficult to do (we would need kind of 2 counters per cpu, and a index that can be changed by the cpu that wants a consolidation (still 'expensive')) struct cpu_offset { local_t offset[2]; }; struct percpu_counter { atomic_long_t count; unsigned int offidx; spinlock_t lock; /* to guard offidx changes */ cpu_offset *counters; }; void percpu_counter_mod(struct percpu_counter *fbc, long amount) { long val; struct cpu_offset *cp; local_t *l; cp = per_cpu_ptr(fbc->counters, get_cpu()); l = &cp[fbc->offidx]; local_add(amount, l); val = local_read(l); if (new >= FBC_BATCH || new <= -FBC_BATCH) { local_set(l, 0); atomic_long_add(val, &fbc->count); } put_cpu(); } long percpu_counter_read_accurate(struct percpu_counter *fbc) { long res = 0, val; int cpu; struct cpu_offset *cp; local_t *l; spin_lock(&fbc->lock); idx = fbc->offidx; fbc->offidx ^= 1; mb(); /* * FIXME : * must 'wait' other cpus dont touch anymore their old local_t */ for_each_cpu(cpu) { cp = per_cpu_ptr(fbc->counters, cpu); l = &cp[idx]; val = local_read(l); /* dont dirty alien cache line if not necessary */ if (val) local_set(l, 0); res += val; } spin_unlock(&fbc->lock); atomic_long_add(res, &fbc->count); return atomic_long_read(&fbc->count); } 3) Are the locked ops so expensive if done on a cache line that is mostly in exclusive state in cpu cache ? Thank you Eric - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
[adding linux-arch] On Sunday 29 January 2006 01:55, Andrew Morton wrote: > Benjamin LaHaise <[EMAIL PROTECTED]> wrote: > > On Sat, Jan 28, 2006 at 01:28:20AM +0100, Eric Dumazet wrote: > > > We might use atomic_long_t only (and no spinlocks) > > > Something like this ? > > > > Erk, complex and slow... Try using local_t instead, which is > > substantially cheaper on the P4 as it doesn't use the lock prefix and act > > as a memory barrier. See asm/local.h. > > local_t isn't much use until we get rid of asm-generic/local.h. Bloaty, > racy with nested interrupts. It is just implemented wrong. It should use local_irq_save()/local_irq_restore() instead. But my bigger problem with local_t is these few architectures (IA64, PPC64) who implement it with atomic_t. This means we can't replace local statistics counters with local_t because it would be regression for them. I haven't done the benchmarks yet, but I suspect both IA64 and PPC64 really should just turn off interrupts. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Sat, Jan 28, 2006 at 08:19:44PM -0500, Benjamin LaHaise wrote: > The overuse of atomics is horrific in what is being proposed. All the > major architectures except powerpc (i386, x86-64, ia64, and sparc64) > implement local_t. It would make far more sense to push the last few > stragglers (which mostly seem to be uniprocessor) into writing the > appropriate implementations. Perhaps it's time to add a #error in > asm-generic/local.h? > Surely asm-generic/local.h could now be reimplemented using atomic_long_t to kill that aberration that is the BITS_PER_LONG != 32 case currently...? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Benjamin LaHaise <[EMAIL PROTECTED]> wrote: > > On Sat, Jan 28, 2006 at 04:55:49PM -0800, Andrew Morton wrote: > > local_t isn't much use until we get rid of asm-generic/local.h. Bloaty, > > racy with nested interrupts. > > The overuse of atomics is horrific in what is being proposed. Well yeah, it wasn't really a very serious proposal. There's some significant work yet to be done on this stuff. > Perhaps it's time to add a #error in asm-generic/local.h? heh, or #warning "you suck" or something. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Sat, Jan 28, 2006 at 04:55:49PM -0800, Andrew Morton wrote: > local_t isn't much use until we get rid of asm-generic/local.h. Bloaty, > racy with nested interrupts. The overuse of atomics is horrific in what is being proposed. All the major architectures except powerpc (i386, x86-64, ia64, and sparc64) implement local_t. It would make far more sense to push the last few stragglers (which mostly seem to be uniprocessor) into writing the appropriate implementations. Perhaps it's time to add a #error in asm-generic/local.h? -ben -- "Ladies and gentlemen, I'm sorry to interrupt, but the police are here and they've asked us to stop the party." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Benjamin LaHaise <[EMAIL PROTECTED]> wrote: > > On Sat, Jan 28, 2006 at 01:28:20AM +0100, Eric Dumazet wrote: > > We might use atomic_long_t only (and no spinlocks) > > Something like this ? > > Erk, complex and slow... Try using local_t instead, which is substantially > cheaper on the P4 as it doesn't use the lock prefix and act as a memory > barrier. See asm/local.h. > local_t isn't much use until we get rid of asm-generic/local.h. Bloaty, racy with nested interrupts. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Sat, Jan 28, 2006 at 01:28:20AM +0100, Eric Dumazet wrote: > We might use atomic_long_t only (and no spinlocks) > Something like this ? Erk, complex and slow... Try using local_t instead, which is substantially cheaper on the P4 as it doesn't use the lock prefix and act as a memory barrier. See asm/local.h. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Ravikiran G Thirumalai a écrit : On Sat, Jan 28, 2006 at 01:35:03AM +0100, Eric Dumazet wrote: Eric Dumazet a écrit : Andrew Morton a écrit : Eric Dumazet <[EMAIL PROTECTED]> wrote: long percpu_counter_read_accurate(struct percpu_counter *fbc) { long res = 0; int cpu; atomic_long_t *pcount; for_each_cpu(cpu) { pcount = per_cpu_ptr(fbc->counters, cpu); /* dont dirty cache line if not necessary */ if (atomic_long_read(pcount)) res += atomic_long_xchg(pcount, 0); ---> (A) } atomic_long_add(res, &fbc->count); ---> (B) res = atomic_long_read(&fbc->count); return res; } The read is still theoritically FBC_BATCH * NR_CPUS inaccurate no? What happens when other cpus update their local counters at (A) and (B)? Well, after atomic_read(&some_atomic) or percpu_counter_read_accurate(&s) the 'value' you got may be inaccurate by whatever... You cannot 'freeze the atomic / percpu_counter and gets its accurate value'. All you can do is 'atomically add/remove some value from this counter (atomic or percpu_counter)) See percpu_counter_read_accurate() as an attempt to collect all local offset (and atomically set them to 0) and atomically reajust the central fbc->count to with the sum of all these offsets. Kind of a consolidation that might be driven by a user request (read /proc/sys/...) or a periodic timer. (I am hoping we don't need percpu_counter_read_accurate anywhere yet and this is just demo ;). I certainly don't want to go on all cpus for read / add cmpxchg on the write path for the proto counters that started this discussion) As pointed by Andrew (If you decode carefully its short sentences :) ), all you really need is atomic_long_xchg() (only in slow path), and atomic_long_{read/add} in fast path) Eric struct percpu_counter { atomic_long_t count; atomic_long_t *counters; }; #ifdef CONFIG_SMP void percpu_counter_mod(struct percpu_counter *fbc, long amount) { long new; atomic_long_t *pcount; pcount = per_cpu_ptr(fbc->counters, get_cpu()); new = atomic_long_read(pcount) + amount; if (new >= FBC_BATCH || new <= -FBC_BATCH) { new = atomic_long_xchg(pcount, 0) + amount; if (new) atomic_long_add(new, &fbc->count); } else atomic_long_add(amount, pcount); put_cpu(); } EXPORT_SYMBOL(percpu_counter_mod); long percpu_counter_read_accurate(struct percpu_counter *fbc) { long res = 0; int cpu; atomic_long_t *pcount; for_each_cpu(cpu) { pcount = per_cpu_ptr(fbc->counters, cpu); /* dont dirty cache line if not necessary */ if (atomic_long_read(pcount)) res += atomic_long_xchg(pcount, 0); } atomic_long_add(res, &fbc->count); return atomic_long_read(&fbc->count); } EXPORT_SYMBOL(percpu_counter_read_accurate); #endif /* CONFIG_SMP */
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Sat, Jan 28, 2006 at 01:35:03AM +0100, Eric Dumazet wrote: > Eric Dumazet a écrit : > >Andrew Morton a écrit : > >>Eric Dumazet <[EMAIL PROTECTED]> wrote: > > > >#ifdef CONFIG_SMP > >void percpu_counter_mod(struct percpu_counter *fbc, long amount) > >{ > > long old, new; > > atomic_long_t *pcount; > > > > pcount = per_cpu_ptr(fbc->counters, get_cpu()); > >start: > > old = atomic_long_read(pcount); > > new = old + amount; > > if (new >= FBC_BATCH || new <= -FBC_BATCH) { > > if (unlikely(atomic_long_cmpxchg(pcount, old, 0) != old)) > > goto start; > > atomic_long_add(new, &fbc->count); > > } else > > atomic_long_add(amount, pcount); > > > > put_cpu(); > >} > >EXPORT_SYMBOL(percpu_counter_mod); > > > >long percpu_counter_read_accurate(struct percpu_counter *fbc) > >{ > > long res = 0; > > int cpu; > > atomic_long_t *pcount; > > > > for_each_cpu(cpu) { > > pcount = per_cpu_ptr(fbc->counters, cpu); > > /* dont dirty cache line if not necessary */ > > if (atomic_long_read(pcount)) > > res += atomic_long_xchg(pcount, 0); ---> (A) > > } > > atomic_long_add(res, &fbc->count); ---> (B) > res = atomic_long_read(&fbc->count); > > > return res; > >} The read is still theoritically FBC_BATCH * NR_CPUS inaccurate no? What happens when other cpus update their local counters at (A) and (B)? (I am hoping we don't need percpu_counter_read_accurate anywhere yet and this is just demo ;). I certainly don't want to go on all cpus for read / add cmpxchg on the write path for the proto counters that started this discussion) Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Eric Dumazet <[EMAIL PROTECTED]> wrote: > > [PATCH] Add atomic_long_xchg() and atomic_long_cmpxchg() wrappers > > ... > > +static inline long atomic_long_xchg(atomic_long_t *l, long val) > +{ > + atomic64_t *v = (atomic64_t *)l; > + return atomic64_xchg(v, val); All we need now is some implementations of atomic64_xchg() ;) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Andrew Morton a écrit : Eric Dumazet <[EMAIL PROTECTED]> wrote: An advantage of retaining a spinlock in percpu_counter is that if accuracy is needed at a low rate (say, /proc reading) we can take the lock and then go spill each CPU's local count into the main one. It would need to be a very low rate though. Or we make the cpu-local counters atomic too. We might use atomic_long_t only (and no spinlocks) Yup, that's it. Something like this ? It'd be a lot neater if we had atomic_long_xchg(). You are my guest :) [PATCH] Add atomic_long_xchg() and atomic_long_cmpxchg() wrappers Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]> --- a/include/asm-generic/atomic.h 2006-01-28 02:59:49.0 +0100 +++ b/include/asm-generic/atomic.h 2006-01-28 02:57:36.0 +0100 @@ -66,6 +66,18 @@ atomic64_sub(i, v); } +static inline long atomic_long_xchg(atomic_long_t *l, long val) +{ + atomic64_t *v = (atomic64_t *)l; + return atomic64_xchg(v, val); +} + +static inline long atomic_long_cmpxchg(atomic_long_t *l, long old, long new) +{ + atomic64_t *v = (atomic64_t *)l; + return atomic64_cmpxchg(v, old, new); +} + #else typedef atomic_t atomic_long_t; @@ -113,5 +125,17 @@ atomic_sub(i, v); } +static inline long atomic_long_xchg(atomic_long_t *l, long val) +{ + atomic_t *v = (atomic_t *)l; + return atomic_xchg(v, val); +} + +static inline long atomic_long_cmpxchg(atomic_long_t *l, long old, long new) +{ + atomic_t *v = (atomic_t *)l; + return atomic_cmpxchg(v, old, new); +} + #endif #endif
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Eric Dumazet <[EMAIL PROTECTED]> wrote: > > > > > An advantage of retaining a spinlock in percpu_counter is that if accuracy > > is needed at a low rate (say, /proc reading) we can take the lock and then > > go spill each CPU's local count into the main one. It would need to be a > > very low rate though. Or we make the cpu-local counters atomic too. > > We might use atomic_long_t only (and no spinlocks) Yup, that's it. > Something like this ? > It'd be a lot neater if we had atomic_long_xchg(). - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Sat, Jan 28, 2006 at 12:21:07AM +0100, Eric Dumazet wrote: > Ravikiran G Thirumalai a écrit : > >On Fri, Jan 27, 2006 at 11:30:23PM +0100, Eric Dumazet wrote: > > Why not use a boot time allocated percpu area (as done today in > setup_per_cpu_areas()), but instead of reserving extra space for module's > percpu data, being able to serve alloc_percpu() from this reserved area (ie > no kmalloced data anymore), and keeping your At that time ia64 placed a limit on the max size of per-CPU area (PERCPU_ENOUGH_ROOM). I think that the limit is still there, But hopefully 64K per-CPU should be enough for static + dynamic + modules? Let me do a allyesconfig on my box and verify. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Eric Dumazet a écrit : Andrew Morton a écrit : Eric Dumazet <[EMAIL PROTECTED]> wrote: Ravikiran G Thirumalai a écrit : On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote: Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: which can be assumed as not frequent. At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only certain conditions, under memory pressure -- on a large CPU count machine, you'd have large memory, and I don't think read_sockets_allocated would get called often. It did not atleast on our 8cpu/16G box. So this should be OK I think. That being said, the percpu_counters aren't a terribly successful concept and probably do need a revisit due to the high inaccuracy at high CPU counts. It might be better to do some generic version of vm_acct_memory() instead. AFAICS vm_acct_memory is no better. The deviation on large cpu counts is the same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ... Ah... yes you are right, I read min(16, NR_CPUS*2) So did I ;) I wonder if it is not a typo... I mean, I understand the more cpus you have, the less updates on central atomic_t is desirable, but a quadratic offset seems too much... I'm not sure whether it was a mistake or if I intended it and didn't do the sums on accuracy :( An advantage of retaining a spinlock in percpu_counter is that if accuracy is needed at a low rate (say, /proc reading) we can take the lock and then go spill each CPU's local count into the main one. It would need to be a very low rate though. Or we make the cpu-local counters atomic too. We might use atomic_long_t only (and no spinlocks) Something like this ? struct percpu_counter { atomic_long_t count; atomic_long_t *counters; }; #ifdef CONFIG_SMP void percpu_counter_mod(struct percpu_counter *fbc, long amount) { long old, new; atomic_long_t *pcount; pcount = per_cpu_ptr(fbc->counters, get_cpu()); start: old = atomic_long_read(pcount); new = old + amount; if (new >= FBC_BATCH || new <= -FBC_BATCH) { if (unlikely(atomic_long_cmpxchg(pcount, old, 0) != old)) goto start; atomic_long_add(new, &fbc->count); } else atomic_long_add(amount, pcount); put_cpu(); } EXPORT_SYMBOL(percpu_counter_mod); long percpu_counter_read_accurate(struct percpu_counter *fbc) { long res = 0; int cpu; atomic_long_t *pcount; for_each_cpu(cpu) { pcount = per_cpu_ptr(fbc->counters, cpu); /* dont dirty cache line if not necessary */ if (atomic_long_read(pcount)) res += atomic_long_xchg(pcount, 0); } atomic_long_add(res, &fbc->count); res = atomic_long_read(&fbc->count); return res; } EXPORT_SYMBOL(percpu_counter_read_accurate); #endif /* CONFIG_SMP */ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > On Fri, Jan 27, 2006 at 03:08:47PM -0800, Andrew Morton wrote: > > Andrew Morton <[EMAIL PROTECTED]> wrote: > > > > > > Oh, and because vm_acct_memory() is counting a singleton object, it can > > > use > > > DEFINE_PER_CPU rather than alloc_percpu(), so it saves on a bit of kmalloc > > > overhead. > > > > Actually, I don't think that's true. we're allocating a sizeof(long) with > > kmalloc_node() so there shouldn't be memory wastage. > > Oh yeah there is. Each dynamic per-cpu object would have been atleast > (NR_CPUS * sizeof (void *) + num_cpus_possible * cacheline_size ). > Now kmalloc_node will fall back on size-32 for allocation of long, so > replace the cacheline_size above with 32 -- which then means dynamic per-cpu > data are not on a cacheline boundary anymore (most modern cpus have > 64byte/128 > byte cache lines) which means per-cpu data could end up false shared > OK. But isn't the core of the problem the fact that __alloc_percpu() is using kmalloc_node() rather than a (new, as-yet-unimplemented) kmalloc_cpu()? kmalloc_cpu() wouldn't need the L1 cache alignment. It might be worth creating just a small number of per-cpu slabs (4-byte, 8-byte). A kmalloc_cpu() would just need a per-cpu array of kmem_cache_t*'s and it'd internally use kmalloc_node(cpu_to_node), no? Or we could just give __alloc_percpu() a custom, hand-rolled, not-cacheline-padded sizeof(long) slab per CPU and use that if (size == sizeof(long)). Or something. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Andrew Morton a écrit : Eric Dumazet <[EMAIL PROTECTED]> wrote: Ravikiran G Thirumalai a écrit : On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote: Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: which can be assumed as not frequent. At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only certain conditions, under memory pressure -- on a large CPU count machine, you'd have large memory, and I don't think read_sockets_allocated would get called often. It did not atleast on our 8cpu/16G box. So this should be OK I think. That being said, the percpu_counters aren't a terribly successful concept and probably do need a revisit due to the high inaccuracy at high CPU counts. It might be better to do some generic version of vm_acct_memory() instead. AFAICS vm_acct_memory is no better. The deviation on large cpu counts is the same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ... Ah... yes you are right, I read min(16, NR_CPUS*2) So did I ;) I wonder if it is not a typo... I mean, I understand the more cpus you have, the less updates on central atomic_t is desirable, but a quadratic offset seems too much... I'm not sure whether it was a mistake or if I intended it and didn't do the sums on accuracy :( An advantage of retaining a spinlock in percpu_counter is that if accuracy is needed at a low rate (say, /proc reading) we can take the lock and then go spill each CPU's local count into the main one. It would need to be a very low rate though. Or we make the cpu-local counters atomic too. We might use atomic_long_t only (and no spinlocks) Something like this ? struct percpu_counter { atomic_long_t count; atomic_long_t *counters; }; #ifdef CONFIG_SMP void percpu_counter_mod(struct percpu_counter *fbc, long amount) { long old, new; atomic_long_t *pcount; pcount = per_cpu_ptr(fbc->counters, get_cpu()); start: old = atomic_long_read(pcount); new = old + amount; if (new >= FBC_BATCH || new <= -FBC_BATCH) { if (unlikely(atomic_long_cmpxchg(pcount, old, 0) != old)) goto start; atomic_long_add(new, &fbc->count); } else atomic_long_add(amount, pcount); put_cpu(); } EXPORT_SYMBOL(percpu_counter_mod); long percpu_counter_read_accurate(struct percpu_counter *fbc) { long res = 0; int cpu; atomic_long_t *pcount; for_each_cpu(cpu) { pcount = per_cpu_ptr(fbc->counters, cpu); /* dont dirty cache line if not necessary */ if (atomic_long_read(pcount)) res += atomic_long_xchg(pcount, 0); } return res; } EXPORT_SYMBOL(percpu_counter_read_accurate); #endif /* CONFIG_SMP */
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Fri, Jan 27, 2006 at 03:08:47PM -0800, Andrew Morton wrote: > Andrew Morton <[EMAIL PROTECTED]> wrote: > > > > Oh, and because vm_acct_memory() is counting a singleton object, it can use > > DEFINE_PER_CPU rather than alloc_percpu(), so it saves on a bit of kmalloc > > overhead. > > Actually, I don't think that's true. we're allocating a sizeof(long) with > kmalloc_node() so there shouldn't be memory wastage. Oh yeah there is. Each dynamic per-cpu object would have been atleast (NR_CPUS * sizeof (void *) + num_cpus_possible * cacheline_size ). Now kmalloc_node will fall back on size-32 for allocation of long, so replace the cacheline_size above with 32 -- which then means dynamic per-cpu data are not on a cacheline boundary anymore (most modern cpus have 64byte/128 byte cache lines) which means per-cpu data could end up false shared Kiran - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Ravikiran G Thirumalai a écrit : On Fri, Jan 27, 2006 at 11:30:23PM +0100, Eric Dumazet wrote: There are several issues here : alloc_percpu() current implementation is a a waste of ram. (because it uses slab allocations that have a minimum size of 32 bytes) Oh there was a solution for that :). http://lwn.net/Articles/119532/ I can quickly revive it if there is interest. Well, nice work ! :) Maybe a litle bit complicated if expected percpu space is 50 KB per cpu ? Why not use a boot time allocated percpu area (as done today in setup_per_cpu_areas()), but instead of reserving extra space for module's percpu data, being able to serve alloc_percpu() from this reserved area (ie no kmalloced data anymore), and keeping your #define per_cpu_ptr(ptr, cpu) ((__typeof__(ptr)) \ (RELOC_HIDE(ptr, PCPU_BLKSIZE * cpu))) Some code from kernel/module.c could be reworked to serve both as an allocator when a module percpudata must be relocated (insmod)/freed (rmmod), and serve alloc_percpu() for 'dynamic allocations' Eric - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Eric Dumazet <[EMAIL PROTECTED]> wrote: > > Ravikiran G Thirumalai a écrit : > > On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote: > >> Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > >>> which can be assumed as not frequent. > >>> At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only > >>> certain conditions, under memory pressure -- on a large CPU count > >>> machine, > >>> you'd have large memory, and I don't think read_sockets_allocated would > >>> get > >>> called often. It did not atleast on our 8cpu/16G box. So this should be > >>> OK > >>> I think. > >> That being said, the percpu_counters aren't a terribly successful concept > >> and probably do need a revisit due to the high inaccuracy at high CPU > >> counts. It might be better to do some generic version of vm_acct_memory() > >> instead. > > > > AFAICS vm_acct_memory is no better. The deviation on large cpu counts is > > the > > same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ... > > Ah... yes you are right, I read min(16, NR_CPUS*2) So did I ;) > I wonder if it is not a typo... I mean, I understand the more cpus you have, > the less updates on central atomic_t is desirable, but a quadratic offset > seems too much... I'm not sure whether it was a mistake or if I intended it and didn't do the sums on accuracy :( An advantage of retaining a spinlock in percpu_counter is that if accuracy is needed at a low rate (say, /proc reading) we can take the lock and then go spill each CPU's local count into the main one. It would need to be a very low rate though. Or we make the cpu-local counters atomic too. Certainly it's sensible to delegate the tuning to the creator of the percpu_counter, but it'll be a difficult thing to get right. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote: > > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > > > > > which can be assumed as not frequent. > > > At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only > > > certain conditions, under memory pressure -- on a large CPU count > > > machine, > > > you'd have large memory, and I don't think read_sockets_allocated would > > > get > > > called often. It did not atleast on our 8cpu/16G box. So this should be > > > OK > > > I think. > > > > That being said, the percpu_counters aren't a terribly successful concept > > and probably do need a revisit due to the high inaccuracy at high CPU > > counts. It might be better to do some generic version of vm_acct_memory() > > instead. > > AFAICS vm_acct_memory is no better. The deviation on large cpu counts is the > same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ... I suppose so. Except vm_acct_memory() has #define ACCT_THRESHOLD max(16, NR_CPUS * 2) But if we were to perform similar tuning to percpu_counter, yes, they're pretty similar. Oh, and because vm_acct_memory() is counting a singleton object, it can use DEFINE_PER_CPU rather than alloc_percpu(), so it saves on a bit of kmalloc overhead. > > > > If the benchmarks say that we need to. If we cannot observe any problems > > in testing of existing code and if we can't demonstrate any benefit from > > the patched code then one option is to go off and do something else ;) > > We first tried plain per-CPU counters for memory_allocated, found that reads > on memory_allocated was causing cacheline transfers, and then > switched over to batching. So batching reads is useful. To avoid > inaccuracy, we can maybe change percpu_counter_init to: > > void percpu_counter_init(struct percpu_counter *fbc, int maxdev) > > the percpu batching limit would then be maxdev/num_possible_cpus. One would > use batching counters only when both reads and writes are frequent. With > the above scheme, we would go fetch cachelines from other cpus for read > often only on large cpu counts, which is not any worse than the global > counter alternative, but it would still be beneficial on smaller machines, > without sacrificing a pre-set deviation. > > Comments? Sounds sane. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Andrew Morton <[EMAIL PROTECTED]> wrote: > > Oh, and because vm_acct_memory() is counting a singleton object, it can use > DEFINE_PER_CPU rather than alloc_percpu(), so it saves on a bit of kmalloc > overhead. Actually, I don't think that's true. we're allocating a sizeof(long) with kmalloc_node() so there shouldn't be memory wastage. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Ravikiran G Thirumalai a écrit : On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote: Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: which can be assumed as not frequent. At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only certain conditions, under memory pressure -- on a large CPU count machine, you'd have large memory, and I don't think read_sockets_allocated would get called often. It did not atleast on our 8cpu/16G box. So this should be OK I think. That being said, the percpu_counters aren't a terribly successful concept and probably do need a revisit due to the high inaccuracy at high CPU counts. It might be better to do some generic version of vm_acct_memory() instead. AFAICS vm_acct_memory is no better. The deviation on large cpu counts is the same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ... Ah... yes you are right, I read min(16, NR_CPUS*2) I wonder if it is not a typo... I mean, I understand the more cpus you have, the less updates on central atomic_t is desirable, but a quadratic offset seems too much... Eric - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Fri, Jan 27, 2006 at 11:30:23PM +0100, Eric Dumazet wrote: > > There are several issues here : > > alloc_percpu() current implementation is a a waste of ram. (because it uses > slab allocations that have a minimum size of 32 bytes) Oh there was a solution for that :). http://lwn.net/Articles/119532/ I can quickly revive it if there is interest. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote: > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > > > which can be assumed as not frequent. > > At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only > > certain conditions, under memory pressure -- on a large CPU count machine, > > you'd have large memory, and I don't think read_sockets_allocated would get > > called often. It did not atleast on our 8cpu/16G box. So this should be > > OK > > I think. > > That being said, the percpu_counters aren't a terribly successful concept > and probably do need a revisit due to the high inaccuracy at high CPU > counts. It might be better to do some generic version of vm_acct_memory() > instead. AFAICS vm_acct_memory is no better. The deviation on large cpu counts is the same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ... > > If the benchmarks say that we need to. If we cannot observe any problems > in testing of existing code and if we can't demonstrate any benefit from > the patched code then one option is to go off and do something else ;) We first tried plain per-CPU counters for memory_allocated, found that reads on memory_allocated was causing cacheline transfers, and then switched over to batching. So batching reads is useful. To avoid inaccuracy, we can maybe change percpu_counter_init to: void percpu_counter_init(struct percpu_counter *fbc, int maxdev) the percpu batching limit would then be maxdev/num_possible_cpus. One would use batching counters only when both reads and writes are frequent. With the above scheme, we would go fetch cachelines from other cpus for read often only on large cpu counts, which is not any worse than the global counter alternative, but it would still be beneficial on smaller machines, without sacrificing a pre-set deviation. Comments? Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Andrew Morton a écrit : Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: On Fri, Jan 27, 2006 at 09:53:53AM +0100, Eric Dumazet wrote: Ravikiran G Thirumalai a écrit : Change the atomic_t sockets_allocated member of struct proto to a per-cpu counter. Signed-off-by: Pravin B. Shelar <[EMAIL PROTECTED]> Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> Hi Ravikiran If I correctly read this patch, I think there is a scalability problem. On a big SMP machine, read_sockets_allocated() is going to be a real killer. Say we have 128 Opterons CPUS in a box. read_sockets_allocated is being invoked when when /proc/net/protocols is read, which can be assumed as not frequent. At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only certain conditions, under memory pressure -- on a large CPU count machine, you'd have large memory, and I don't think read_sockets_allocated would get called often. It did not atleast on our 8cpu/16G box. So this should be OK I think. That being said, the percpu_counters aren't a terribly successful concept and probably do need a revisit due to the high inaccuracy at high CPU counts. It might be better to do some generic version of vm_acct_memory() instead. There are several issues here : alloc_percpu() current implementation is a a waste of ram. (because it uses slab allocations that have a minimum size of 32 bytes) Currently we cannot use per_cpu(&some_object, cpu), so a generic version of vm_acct_memory() would need a rework of percpu.h and maybe this is not possible on every platform ? #define per_cpu(var, cpu) (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu])) --> #define per_cpu_name(var) per_cpu__##var #define per_cpu_addr(var) &per_cpu_name(var) #define per_cpu(var, cpu) (*RELOC_HIDE(per_cpu_addr(var), __per_cpu_offset[cpu]) But this could render TLS migration difficult... Eric - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > On Fri, Jan 27, 2006 at 09:53:53AM +0100, Eric Dumazet wrote: > > Ravikiran G Thirumalai a écrit : > > >Change the atomic_t sockets_allocated member of struct proto to a > > >per-cpu counter. > > > > > >Signed-off-by: Pravin B. Shelar <[EMAIL PROTECTED]> > > >Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> > > >Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> > > > > > Hi Ravikiran > > > > If I correctly read this patch, I think there is a scalability problem. > > > > On a big SMP machine, read_sockets_allocated() is going to be a real killer. > > > > Say we have 128 Opterons CPUS in a box. > > read_sockets_allocated is being invoked when when /proc/net/protocols is read, > which can be assumed as not frequent. > At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only > certain conditions, under memory pressure -- on a large CPU count machine, > you'd have large memory, and I don't think read_sockets_allocated would get > called often. It did not atleast on our 8cpu/16G box. So this should be OK > I think. That being said, the percpu_counters aren't a terribly successful concept and probably do need a revisit due to the high inaccuracy at high CPU counts. It might be better to do some generic version of vm_acct_memory() instead. If the benchmarks say that we need to. If we cannot observe any problems in testing of existing code and if we can't demonstrate any benefit from the patched code then one option is to go off and do something else ;) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Fri, Jan 27, 2006 at 09:53:53AM +0100, Eric Dumazet wrote: > Ravikiran G Thirumalai a écrit : > >Change the atomic_t sockets_allocated member of struct proto to a > >per-cpu counter. > > > >Signed-off-by: Pravin B. Shelar <[EMAIL PROTECTED]> > >Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> > >Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> > > > Hi Ravikiran > > If I correctly read this patch, I think there is a scalability problem. > > On a big SMP machine, read_sockets_allocated() is going to be a real killer. > > Say we have 128 Opterons CPUS in a box. read_sockets_allocated is being invoked when when /proc/net/protocols is read, which can be assumed as not frequent. At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only certain conditions, under memory pressure -- on a large CPU count machine, you'd have large memory, and I don't think read_sockets_allocated would get called often. It did not atleast on our 8cpu/16G box. So this should be OK I think. There're no 128 CPU Opteron boxes yet afaik ;). Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Ravikiran G Thirumalai a écrit : Change the atomic_t sockets_allocated member of struct proto to a per-cpu counter. Signed-off-by: Pravin B. Shelar <[EMAIL PROTECTED]> Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> Hi Ravikiran If I correctly read this patch, I think there is a scalability problem. On a big SMP machine, read_sockets_allocated() is going to be a real killer. Say we have 128 Opterons CPUS in a box. You'll need to bring 128 cache lines (plus 8*128 bytes to read the 128 pointers inside percpu structure) I think a solution 'a la percpu_counter' is preferable, or even better a dedicated per_cpu with a threshold management (see mm/swap.c , function vm_acct_memory() to see how vm_committed_space is updated without too bad SMP scalability) Thank you Eric - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
Change the atomic_t sockets_allocated member of struct proto to a per-cpu counter. Signed-off-by: Pravin B. Shelar <[EMAIL PROTECTED]> Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]> Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]> Index: linux-2.6.16-rc1/include/net/sock.h === --- linux-2.6.16-rc1.orig/include/net/sock.h2006-01-24 14:37:45.0 -0800 +++ linux-2.6.16-rc1/include/net/sock.h 2006-01-24 17:35:34.0 -0800 @@ -543,7 +543,7 @@ struct proto { /* Memory pressure */ void(*enter_memory_pressure)(void); struct percpu_counter *memory_allocated; /* Current allocated memory. */ - atomic_t*sockets_allocated; /* Current number of sockets. */ + int *sockets_allocated; /* Current number of sockets (percpu counter). */ /* * Pressure flag: try to collapse. @@ -579,6 +579,24 @@ struct proto { } stats[NR_CPUS]; }; +static inline int read_sockets_allocated(struct proto *prot) +{ + int total = 0; + int cpu; + for_each_cpu(cpu) + total += *per_cpu_ptr(prot->sockets_allocated, cpu); + return total; +} + +static inline void mod_sockets_allocated(int *sockets_allocated, int count) +{ + (*per_cpu_ptr(sockets_allocated, get_cpu())) += count; + put_cpu(); +} + +#define inc_sockets_allocated(c) mod_sockets_allocated(c, 1) +#define dec_sockets_allocated(c) mod_sockets_allocated(c, -1) + extern int proto_register(struct proto *prot, int alloc_slab); extern void proto_unregister(struct proto *prot); Index: linux-2.6.16-rc1/include/net/tcp.h === --- linux-2.6.16-rc1.orig/include/net/tcp.h 2006-01-24 14:37:45.0 -0800 +++ linux-2.6.16-rc1/include/net/tcp.h 2006-01-24 17:05:34.0 -0800 @@ -221,7 +221,7 @@ extern int sysctl_tcp_tso_win_divisor; extern int sysctl_tcp_abc; extern struct percpu_counter tcp_memory_allocated; -extern atomic_t tcp_sockets_allocated; +extern int *tcp_sockets_allocated; extern int tcp_memory_pressure; /* Index: linux-2.6.16-rc1/net/core/sock.c === --- linux-2.6.16-rc1.orig/net/core/sock.c 2006-01-24 14:37:45.0 -0800 +++ linux-2.6.16-rc1/net/core/sock.c2006-01-24 16:47:55.0 -0800 @@ -771,7 +771,7 @@ struct sock *sk_clone(const struct sock newsk->sk_sleep = NULL; if (newsk->sk_prot->sockets_allocated) - atomic_inc(newsk->sk_prot->sockets_allocated); + inc_sockets_allocated(newsk->sk_prot->sockets_allocated); } out: return newsk; @@ -1620,7 +1620,7 @@ static void proto_seq_printf(struct seq_ "%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n", proto->name, proto->obj_size, - proto->sockets_allocated != NULL ? atomic_read(proto->sockets_allocated) : -1, + proto->sockets_allocated != NULL ? read_sockets_allocated(proto) : -1, proto->memory_allocated != NULL ? percpu_counter_read(proto->memory_allocated) : -1, proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI", Index: linux-2.6.16-rc1/net/core/stream.c === --- linux-2.6.16-rc1.orig/net/core/stream.c 2006-01-24 14:37:45.0 -0800 +++ linux-2.6.16-rc1/net/core/stream.c 2006-01-24 16:20:22.0 -0800 @@ -243,7 +243,7 @@ int sk_stream_mem_schedule(struct sock * return 1; if (!*sk->sk_prot->memory_pressure || - sk->sk_prot->sysctl_mem[2] > atomic_read(sk->sk_prot->sockets_allocated) * + sk->sk_prot->sysctl_mem[2] > read_sockets_allocated(sk->sk_prot) * sk_stream_pages(sk->sk_wmem_queued + atomic_read(&sk->sk_rmem_alloc) + sk->sk_forward_alloc)) Index: linux-2.6.16-rc1/net/ipv4/proc.c === --- linux-2.6.16-rc1.orig/net/ipv4/proc.c 2006-01-24 14:37:45.0 -0800 +++ linux-2.6.16-rc1/net/ipv4/proc.c2006-01-24 16:20:22.0 -0800 @@ -63,7 +63,7 @@ static int sockstat_seq_show(struct seq_ socket_seq_show(seq); seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %lu\n", fold_prot_inuse(&tcp_prot), atomic_read(&tcp_orphan_count), - tcp_death_row.tw_count, atomic_read(&tcp_sockets_allocated), + tcp_death_row.tw_count, read_sockets_allocated(&tcp_prot), percpu_co