Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-02-03 Thread Andrew Morton
Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
>
> On Thu, Feb 02, 2006 at 07:16:00PM -0800, Andrew Morton wrote:
> > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
> > >
> > > On Fri, Jan 27, 2006 at 03:01:06PM -0800, Andrew Morton wrote:
> > > Here's an implementation which delegates tuning of batching to the user.  
> > > We
> > > don't really need local_t at all as percpu_counter_mod is not safe against
> > > interrupts and softirqs  as it is.  If we have a counter which could be
> > > modified in process context and irq/bh context, we just have to use a
> > > wrapper like percpu_counter_mod_bh which will just disable and enable 
> > > bottom
> > > halves.  Reads on the counters are safe as they are atomic_reads, and the
> > > cpu local variables are always accessed by that cpu only.
> > > 
> > > (PS: the maxerr for ext2/ext3 is just guesstimate)
> > 
> > Well that's the problem.  We need to choose production-quality values for
> > use in there.
> 
> The guesstimate was loosely based on keeping the per-cpu batch at atleast 8
> on reasonably sized systems, while not letting maxerr grow too big.  I guess
> machines with cpu counts more than 128 don't use ext3 :).  And if they do,
> they can tune the counters with a higher maxerr.  I guess it might be a bit
> ugly on the user side with all the if num_possibl_cpus(), but is there a
> better alternative?
> 
> (I plan to test the counter values for ext2 and ext3 on a 16 way box, and
> change these if they turn out to be not so good)

OK, thanks.  Frankly I think I went overboard on the scalability thing when
adding percpu counters to ext2 and ext3.  I suspect they're not providing
significant benefit over per-sb-spinlock and a ulong.

> > 
> > > Comments?
> > 
> > Using num_possible_cpus() in that header file is just asking for build
> > errors.  Probably best to uninline the function rather than adding the
> > needed include of cpumask.h.
> 
> Yup,
> 
> Here it is.
> 
> Change the percpu_counter interface so that user can specify the maximum
> tolerable deviation.

OK, thanks.  I need to sit down and a) remember why we're even discussing
this and b) see what we've merged thus far and work out what it all does ;)

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-02-03 Thread Ravikiran G Thirumalai
On Thu, Feb 02, 2006 at 07:16:00PM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
> >
> > On Fri, Jan 27, 2006 at 03:01:06PM -0800, Andrew Morton wrote:
> > Here's an implementation which delegates tuning of batching to the user.  We
> > don't really need local_t at all as percpu_counter_mod is not safe against
> > interrupts and softirqs  as it is.  If we have a counter which could be
> > modified in process context and irq/bh context, we just have to use a
> > wrapper like percpu_counter_mod_bh which will just disable and enable bottom
> > halves.  Reads on the counters are safe as they are atomic_reads, and the
> > cpu local variables are always accessed by that cpu only.
> > 
> > (PS: the maxerr for ext2/ext3 is just guesstimate)
> 
> Well that's the problem.  We need to choose production-quality values for
> use in there.

The guesstimate was loosely based on keeping the per-cpu batch at atleast 8
on reasonably sized systems, while not letting maxerr grow too big.  I guess
machines with cpu counts more than 128 don't use ext3 :).  And if they do,
they can tune the counters with a higher maxerr.  I guess it might be a bit
ugly on the user side with all the if num_possibl_cpus(), but is there a
better alternative?

(I plan to test the counter values for ext2 and ext3 on a 16 way box, and
change these if they turn out to be not so good)

> 
> > Comments?
> 
> Using num_possible_cpus() in that header file is just asking for build
> errors.  Probably best to uninline the function rather than adding the
> needed include of cpumask.h.

Yup,

Here it is.

Change the percpu_counter interface so that user can specify the maximum
tolerable deviation.

Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>

Index: linux-2.6.16-rc1mm4/include/linux/percpu_counter.h
===
--- linux-2.6.16-rc1mm4.orig/include/linux/percpu_counter.h 2006-02-02 
11:18:54.0 -0800
+++ linux-2.6.16-rc1mm4/include/linux/percpu_counter.h  2006-02-03 
11:04:05.0 -0800
@@ -16,24 +16,20 @@
 
 struct percpu_counter {
atomic_long_t count;
+   int percpu_batch;
long *counters;
 };
 
-#if NR_CPUS >= 16
-#define FBC_BATCH  (NR_CPUS*2)
-#else
-#define FBC_BATCH  (NR_CPUS*4)
-#endif
-
-static inline void percpu_counter_init(struct percpu_counter *fbc)
-{
-   atomic_long_set(&fbc->count, 0);
-   fbc->counters = alloc_percpu(long);
-}
+/* 
+ * Choose maxerr carefully. maxerr/num_possible_cpus indicates per-cpu 
+ * batching. Set maximum tolerance for better performance on large systems.
+ */
+void percpu_counter_init(struct percpu_counter *fbc, unsigned int maxerr);
 
 static inline void percpu_counter_destroy(struct percpu_counter *fbc)
 {
-   free_percpu(fbc->counters);
+   if (fbc->percpu_batch)
+   free_percpu(fbc->counters);
 }
 
 void percpu_counter_mod(struct percpu_counter *fbc, long amount);
@@ -63,7 +59,8 @@ struct percpu_counter {
long count;
 };
 
-static inline void percpu_counter_init(struct percpu_counter *fbc)
+static inline void percpu_counter_init(struct percpu_counter *fbc, 
+   unsigned int maxerr)
 {
fbc->count = 0;
 }
Index: linux-2.6.16-rc1mm4/mm/swap.c
===
--- linux-2.6.16-rc1mm4.orig/mm/swap.c  2006-01-29 20:20:20.0 -0800
+++ linux-2.6.16-rc1mm4/mm/swap.c   2006-02-03 11:02:05.0 -0800
@@ -468,15 +468,39 @@ static int cpu_swap_callback(struct noti
 #endif /* CONFIG_SMP */
 
 #ifdef CONFIG_SMP
+
+/* 
+ * Choose maxerr carefully. maxerr/num_possible_cpus indicates per-cpu
+ * batching. Set maximum tolerance for better performance on large systems.
+ */
+void percpu_counter_init(struct percpu_counter *fbc, unsigned int maxerr)
+{
+   atomic_long_set(&fbc->count, 0);
+   fbc->percpu_batch = maxerr/num_possible_cpus();
+   if (fbc->percpu_batch) {
+   fbc->counters = alloc_percpu(long);
+   if (!fbc->counters)
+   fbc->percpu_batch = 0;
+   }
+   
+}
+
 void percpu_counter_mod(struct percpu_counter *fbc, long amount)
 {
-   long count;
long *pcount;
-   int cpu = get_cpu();
+   long count;
+   int cpu;
 
+   /* Slow mode */
+   if (unlikely(!fbc->percpu_batch)) {
+   atomic_long_add(amount, &fbc->count);
+   return;
+   }
+   
+   cpu = get_cpu();
pcount = per_cpu_ptr(fbc->counters, cpu);
count = *pcount + amount;
-   if (count >= FBC_BATCH || count <= -FBC_BATCH) {
+   if (count >= fbc->percpu_batch || count <= -fbc->percpu_batch) {
atomic_long_add(count, &fbc->count);
count = 0;
}
@@ -484,6 +508,7 @@ void percpu_counter_mod(struct percpu_co
put_cpu();
 }
 EXPORT_SYMBOL(percpu_counter_mod);
+EXPORT_SYMBOL(percpu_counter_init)

Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-02-02 Thread Andrew Morton
Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
>
> On Fri, Jan 27, 2006 at 03:01:06PM -0800, Andrew Morton wrote:
> > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
> > 
> > 
> > > > 
> > > > If the benchmarks say that we need to.  If we cannot observe any 
> > > > problems
> > > > in testing of existing code and if we can't demonstrate any benefit from
> > > > the patched code then one option is to go off and do something else ;)
> > > 
> > > We first tried plain per-CPU counters for memory_allocated, found that 
> > > reads
> > > on memory_allocated was causing cacheline transfers, and then
> > > switched over to batching.  So batching reads is useful.  To avoid
> > > inaccuracy, we can maybe change percpu_counter_init to:
> > > 
> > > void percpu_counter_init(struct percpu_counter *fbc, int maxdev)
> > > 
> > > the percpu batching limit would then be maxdev/num_possible_cpus.  One 
> > > would
> > > use batching counters only when both reads and writes are frequent.  With
> > > the above scheme, we would go fetch cachelines from other cpus for read
> > > often only on large cpu counts, which is not any worse than the global
> > > counter alternative, but it would still be beneficial on smaller machines,
> > > without sacrificing a pre-set deviation.  
> > > 
> > > Comments?
> > 
> > Sounds sane.
> >
> 
> Here's an implementation which delegates tuning of batching to the user.  We
> don't really need local_t at all as percpu_counter_mod is not safe against
> interrupts and softirqs  as it is.  If we have a counter which could be
> modified in process context and irq/bh context, we just have to use a
> wrapper like percpu_counter_mod_bh which will just disable and enable bottom
> halves.  Reads on the counters are safe as they are atomic_reads, and the
> cpu local variables are always accessed by that cpu only.
> 
> (PS: the maxerr for ext2/ext3 is just guesstimate)

Well that's the problem.  We need to choose production-quality values for
use in there.

> Comments?

Using num_possible_cpus() in that header file is just asking for build
errors.  Probably best to uninline the function rather than adding the
needed include of cpumask.h.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-02-02 Thread Ravikiran G Thirumalai
On Fri, Jan 27, 2006 at 03:01:06PM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
> 
> 
> > > 
> > > If the benchmarks say that we need to.  If we cannot observe any problems
> > > in testing of existing code and if we can't demonstrate any benefit from
> > > the patched code then one option is to go off and do something else ;)
> > 
> > We first tried plain per-CPU counters for memory_allocated, found that reads
> > on memory_allocated was causing cacheline transfers, and then
> > switched over to batching.  So batching reads is useful.  To avoid
> > inaccuracy, we can maybe change percpu_counter_init to:
> > 
> > void percpu_counter_init(struct percpu_counter *fbc, int maxdev)
> > 
> > the percpu batching limit would then be maxdev/num_possible_cpus.  One would
> > use batching counters only when both reads and writes are frequent.  With
> > the above scheme, we would go fetch cachelines from other cpus for read
> > often only on large cpu counts, which is not any worse than the global
> > counter alternative, but it would still be beneficial on smaller machines,
> > without sacrificing a pre-set deviation.  
> > 
> > Comments?
> 
> Sounds sane.
>

Here's an implementation which delegates tuning of batching to the user.  We
don't really need local_t at all as percpu_counter_mod is not safe against
interrupts and softirqs  as it is.  If we have a counter which could be
modified in process context and irq/bh context, we just have to use a
wrapper like percpu_counter_mod_bh which will just disable and enable bottom
halves.  Reads on the counters are safe as they are atomic_reads, and the
cpu local variables are always accessed by that cpu only.

(PS: the maxerr for ext2/ext3 is just guesstimate)

Comments?

Index: linux-2.6.16-rc1mm4/include/linux/percpu_counter.h
===
--- linux-2.6.16-rc1mm4.orig/include/linux/percpu_counter.h 2006-02-02 
11:18:54.0 -0800
+++ linux-2.6.16-rc1mm4/include/linux/percpu_counter.h  2006-02-02 
18:29:46.0 -0800
@@ -16,24 +16,32 @@
 
 struct percpu_counter {
atomic_long_t count;
+   int percpu_batch;
long *counters;
 };
 
-#if NR_CPUS >= 16
-#define FBC_BATCH  (NR_CPUS*2)
-#else
-#define FBC_BATCH  (NR_CPUS*4)
-#endif
 
-static inline void percpu_counter_init(struct percpu_counter *fbc)
+/* 
+ * Choose maxerr carefully. maxerr/num_possible_cpus indicates per-cpu 
batching 
+ * Set maximum tolerance for better performance on large systems.
+ */
+static inline void percpu_counter_init(struct percpu_counter *fbc, 
+   unsigned int maxerr)
 {
atomic_long_set(&fbc->count, 0);
-   fbc->counters = alloc_percpu(long);
+   fbc->percpu_batch = maxerr/num_possible_cpus();
+   if (fbc->percpu_batch) {
+   fbc->counters = alloc_percpu(long);
+   if (!fbc->counters)
+   fbc->percpu_batch = 0;
+   }
+   
 }
 
 static inline void percpu_counter_destroy(struct percpu_counter *fbc)
 {
-   free_percpu(fbc->counters);
+   if (fbc->percpu_batch)
+   free_percpu(fbc->counters);
 }
 
 void percpu_counter_mod(struct percpu_counter *fbc, long amount);
@@ -63,7 +71,8 @@ struct percpu_counter {
long count;
 };
 
-static inline void percpu_counter_init(struct percpu_counter *fbc)
+static inline void percpu_counter_init(struct percpu_counter *fbc, 
+   unsigned int maxerr)
 {
fbc->count = 0;
 }
Index: linux-2.6.16-rc1mm4/mm/swap.c
===
--- linux-2.6.16-rc1mm4.orig/mm/swap.c  2006-01-29 20:20:20.0 -0800
+++ linux-2.6.16-rc1mm4/mm/swap.c   2006-02-02 18:36:21.0 -0800
@@ -470,13 +470,20 @@ static int cpu_swap_callback(struct noti
 #ifdef CONFIG_SMP
 void percpu_counter_mod(struct percpu_counter *fbc, long amount)
 {
-   long count;
long *pcount;
-   int cpu = get_cpu();
+   long count;
+   int cpu;
 
+   /* Slow mode */
+   if (unlikely(!fbc->percpu_batch)) {
+   atomic_long_add(amount, &fbc->count);
+   return;
+   }
+   
+   cpu = get_cpu();
pcount = per_cpu_ptr(fbc->counters, cpu);
count = *pcount + amount;
-   if (count >= FBC_BATCH || count <= -FBC_BATCH) {
+   if (count >= fbc->percpu_batch || count <= -fbc->percpu_batch) {
atomic_long_add(count, &fbc->count);
count = 0;
}
Index: linux-2.6.16-rc1mm4/fs/ext2/super.c
===
--- linux-2.6.16-rc1mm4.orig/fs/ext2/super.c2006-02-02 18:30:28.0 
-0800
+++ linux-2.6.16-rc1mm4/fs/ext2/super.c 2006-02-02 18:36:39.0 -0800
@@ -610,6 +610,7 @@ static int ext2_fill_super(struct super_
int db_count;
int i, j;
__le32 features;
+   i

Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-29 Thread Benjamin LaHaise
On Sun, Jan 29, 2006 at 07:54:09AM +0100, Eric Dumazet wrote:
> Well, I think that might be doable, maybe RCU magic ?
> 
> 1) local_t are not that nice on all archs.

It is for the users that matter, and the hooks are there if someone finds 
it to be a performance problem.

> 2) The consolidation phase (summing all the cpus local offset to 
> consolidate the central counter) might be more difficult to do (we would 
> need kind of 2 counters per cpu, and a index that can be changed by the cpu 
> that wants a consolidation (still 'expensive'))

For the vast majority of these sorts of statistics counters, we don't 
need 100% accurate counts.  And I think it should be possible to choose 
between a lightweight implementation and the expensive implementation.  
On a chip like the Core Duo the cost of bouncing between the two cores 
is minimal, so all the extra code and data is a waste.

> 3) Are the locked ops so expensive if done on a cache line that is mostly 
> in exclusive state in cpu cache ?

Yes.  What happens on the P4 is that it forces outstanding memory 
transactions in the reorder buffer to be flushed so that the memory barrier 
semantics of the lock prefix are observed.  This can take a long time as
there can be over a hundred instructions in flight.

-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-28 Thread Eric Dumazet

Benjamin LaHaise a écrit :

On Sat, Jan 28, 2006 at 01:28:20AM +0100, Eric Dumazet wrote:

We might use atomic_long_t only (and no spinlocks)
Something like this ?


Erk, complex and slow...  Try using local_t instead, which is substantially 
cheaper on the P4 as it doesn't use the lock prefix and act as a memory 
barrier.  See asm/local.h.




Well, I think that might be doable, maybe RCU magic ?

1) local_t are not that nice on all archs.

2) The consolidation phase (summing all the cpus local offset to consolidate 
the central counter) might be more difficult to do (we would need kind of 2 
counters per cpu, and a index that can be changed by the cpu that wants a 
consolidation (still 'expensive'))


struct cpu_offset {
local_t   offset[2];
};

struct percpu_counter {
atomic_long_t count;
unsigned int offidx;
spinlock_t   lock; /* to guard offidx changes */
cpu_offset *counters;
};

void percpu_counter_mod(struct percpu_counter *fbc, long amount)
{
long val;
struct cpu_offset *cp;
local_t *l;

cp = per_cpu_ptr(fbc->counters, get_cpu());
l = &cp[fbc->offidx];

local_add(amount, l);
val = local_read(l);
if (new >= FBC_BATCH || new <= -FBC_BATCH) {
local_set(l, 0);
atomic_long_add(val, &fbc->count);
}
put_cpu();
}

long percpu_counter_read_accurate(struct percpu_counter *fbc)
{
long res = 0, val;
int cpu;
struct cpu_offset *cp;
local_t *l;

spin_lock(&fbc->lock);
idx = fbc->offidx;
fbc->offidx ^= 1;
mb();
/*
 * FIXME :
 *  must 'wait' other cpus dont touch anymore their old local_t
 */
for_each_cpu(cpu) {
cp = per_cpu_ptr(fbc->counters, cpu);
l = &cp[idx];
val = local_read(l);
/* dont dirty alien cache line if not necessary */
if (val)
local_set(l, 0);
res += val;
}
spin_unlock(&fbc->lock);
atomic_long_add(res, &fbc->count);
return atomic_long_read(&fbc->count);
}



3) Are the locked ops so expensive if done on a cache line that is mostly in 
exclusive state in cpu cache ?


Thank you
Eric

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-28 Thread Andi Kleen
[adding linux-arch]

On Sunday 29 January 2006 01:55, Andrew Morton wrote:
> Benjamin LaHaise <[EMAIL PROTECTED]> wrote:
> > On Sat, Jan 28, 2006 at 01:28:20AM +0100, Eric Dumazet wrote:
> > > We might use atomic_long_t only (and no spinlocks)
> > > Something like this ?
> >
> > Erk, complex and slow...  Try using local_t instead, which is
> > substantially cheaper on the P4 as it doesn't use the lock prefix and act
> > as a memory barrier.  See asm/local.h.
>
> local_t isn't much use until we get rid of asm-generic/local.h.  Bloaty,
> racy with nested interrupts.

It is just implemented wrong. It should use 
local_irq_save()/local_irq_restore() instead. But my bigger problem
with local_t is these few architectures (IA64, PPC64) who implement it 
with atomic_t. This means we can't replace local statistics counters
with local_t because it would be regression for them. I haven't done the
benchmarks yet, but I suspect both IA64 and PPC64 really should
just turn off interrupts.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-28 Thread Kyle McMartin
On Sat, Jan 28, 2006 at 08:19:44PM -0500, Benjamin LaHaise wrote:
> The overuse of atomics is horrific in what is being proposed.  All the
> major architectures except powerpc (i386, x86-64, ia64, and sparc64) 
> implement local_t.  It would make far more sense to push the last few 
> stragglers (which mostly seem to be uniprocessor) into writing the 
> appropriate implementations.  Perhaps it's time to add a #error in 
> asm-generic/local.h?
>

Surely asm-generic/local.h could now be reimplemented using atomic_long_t
to kill that aberration that is the BITS_PER_LONG != 32 case currently...? 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-28 Thread Andrew Morton
Benjamin LaHaise <[EMAIL PROTECTED]> wrote:
>
> On Sat, Jan 28, 2006 at 04:55:49PM -0800, Andrew Morton wrote:
> > local_t isn't much use until we get rid of asm-generic/local.h.  Bloaty,
> > racy with nested interrupts.
> 
> The overuse of atomics is horrific in what is being proposed.

Well yeah, it wasn't really a very serious proposal.  There's some
significant work yet to be done on this stuff.

> Perhaps it's time to add a #error in asm-generic/local.h?

heh, or #warning "you suck" or something.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-28 Thread Benjamin LaHaise
On Sat, Jan 28, 2006 at 04:55:49PM -0800, Andrew Morton wrote:
> local_t isn't much use until we get rid of asm-generic/local.h.  Bloaty,
> racy with nested interrupts.

The overuse of atomics is horrific in what is being proposed.  All the
major architectures except powerpc (i386, x86-64, ia64, and sparc64) 
implement local_t.  It would make far more sense to push the last few 
stragglers (which mostly seem to be uniprocessor) into writing the 
appropriate implementations.  Perhaps it's time to add a #error in 
asm-generic/local.h?

-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-28 Thread Andrew Morton
Benjamin LaHaise <[EMAIL PROTECTED]> wrote:
>
> On Sat, Jan 28, 2006 at 01:28:20AM +0100, Eric Dumazet wrote:
> > We might use atomic_long_t only (and no spinlocks)
> > Something like this ?
> 
> Erk, complex and slow...  Try using local_t instead, which is substantially 
> cheaper on the P4 as it doesn't use the lock prefix and act as a memory 
> barrier.  See asm/local.h.
> 

local_t isn't much use until we get rid of asm-generic/local.h.  Bloaty,
racy with nested interrupts.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-28 Thread Benjamin LaHaise
On Sat, Jan 28, 2006 at 01:28:20AM +0100, Eric Dumazet wrote:
> We might use atomic_long_t only (and no spinlocks)
> Something like this ?

Erk, complex and slow...  Try using local_t instead, which is substantially 
cheaper on the P4 as it doesn't use the lock prefix and act as a memory 
barrier.  See asm/local.h.

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Eric Dumazet

Ravikiran G Thirumalai a écrit :

On Sat, Jan 28, 2006 at 01:35:03AM +0100, Eric Dumazet wrote:

Eric Dumazet a écrit :

Andrew Morton a écrit :

Eric Dumazet <[EMAIL PROTECTED]> wrote:


long percpu_counter_read_accurate(struct percpu_counter *fbc)
{
long res = 0;
int cpu;
atomic_long_t *pcount;

for_each_cpu(cpu) {
pcount = per_cpu_ptr(fbc->counters, cpu);
/* dont dirty cache line if not necessary */
if (atomic_long_read(pcount))
res += atomic_long_xchg(pcount, 0);

--->  (A)

}



atomic_long_add(res, &fbc->count);

--->  (B)

res = atomic_long_read(&fbc->count);


return res;
}


The read is still theoritically FBC_BATCH * NR_CPUS inaccurate no?
What happens when other cpus update  their local counters at (A) and (B)?


Well, after  atomic_read(&some_atomic) or percpu_counter_read_accurate(&s) the 
 'value' you got may be inaccurate by whatever... You cannot 'freeze the 
atomic / percpu_counter and gets its accurate value'. All you can do is 
'atomically add/remove some value from this counter (atomic or percpu_counter))


See percpu_counter_read_accurate() as an attempt to collect all local offset 
(and atomically set them to 0) and atomically reajust the central fbc->count 
to with the sum of all these offsets. Kind of a consolidation that might be 
driven by a user request (read /proc/sys/...) or a periodic timer.






(I am hoping we don't need percpu_counter_read_accurate anywhere yet and
this is just demo ;).  I certainly don't want to go on all cpus for read / 
add cmpxchg on the write path for the proto counters that started this 
discussion)


As pointed by Andrew (If you decode carefully its short sentences :) ), all 
you really need is atomic_long_xchg() (only in slow path), and 
atomic_long_{read/add} in fast path)



Eric
struct percpu_counter {
atomic_long_t count;
atomic_long_t *counters;
};

#ifdef CONFIG_SMP
void percpu_counter_mod(struct percpu_counter *fbc, long amount)
{
long new;
atomic_long_t *pcount;

pcount = per_cpu_ptr(fbc->counters, get_cpu());

new = atomic_long_read(pcount) + amount;
 
if (new >= FBC_BATCH || new <= -FBC_BATCH) {
new = atomic_long_xchg(pcount, 0) + amount;
if (new)
atomic_long_add(new, &fbc->count);
} else
atomic_long_add(amount, pcount);

put_cpu();
}
EXPORT_SYMBOL(percpu_counter_mod);

long percpu_counter_read_accurate(struct percpu_counter *fbc)
{
long res = 0;
int cpu;
atomic_long_t *pcount;

for_each_cpu(cpu) {
pcount = per_cpu_ptr(fbc->counters, cpu);
/* dont dirty cache line if not necessary */
if (atomic_long_read(pcount))
res += atomic_long_xchg(pcount, 0);
}
atomic_long_add(res, &fbc->count);
return atomic_long_read(&fbc->count);
}
EXPORT_SYMBOL(percpu_counter_read_accurate);
#endif /* CONFIG_SMP */



Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Ravikiran G Thirumalai
On Sat, Jan 28, 2006 at 01:35:03AM +0100, Eric Dumazet wrote:
> Eric Dumazet a écrit :
> >Andrew Morton a écrit :
> >>Eric Dumazet <[EMAIL PROTECTED]> wrote:
> >
> >#ifdef CONFIG_SMP
> >void percpu_counter_mod(struct percpu_counter *fbc, long amount)
> >{
> > long old, new;
> > atomic_long_t *pcount;
> >
> > pcount = per_cpu_ptr(fbc->counters, get_cpu());
> >start:
> > old = atomic_long_read(pcount);
> > new = old + amount;
> > if (new >= FBC_BATCH || new <= -FBC_BATCH) {
> > if (unlikely(atomic_long_cmpxchg(pcount, old, 0) != old))
> > goto start;
> > atomic_long_add(new, &fbc->count);
> > } else
> > atomic_long_add(amount, pcount);
> >
> > put_cpu();
> >}
> >EXPORT_SYMBOL(percpu_counter_mod);
> >
> >long percpu_counter_read_accurate(struct percpu_counter *fbc)
> >{
> > long res = 0;
> > int cpu;
> > atomic_long_t *pcount;
> >
> > for_each_cpu(cpu) {
> > pcount = per_cpu_ptr(fbc->counters, cpu);
> > /* dont dirty cache line if not necessary */
> > if (atomic_long_read(pcount))
> > res += atomic_long_xchg(pcount, 0);
--->  (A)
> > }
> 

>   atomic_long_add(res, &fbc->count);
--->  (B)
>   res = atomic_long_read(&fbc->count);
> 
> > return res;
> >}

The read is still theoritically FBC_BATCH * NR_CPUS inaccurate no?
What happens when other cpus update  their local counters at (A) and (B)?

(I am hoping we don't need percpu_counter_read_accurate anywhere yet and
this is just demo ;).  I certainly don't want to go on all cpus for read / 
add cmpxchg on the write path for the proto counters that started this 
discussion)

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Andrew Morton
Eric Dumazet <[EMAIL PROTECTED]> wrote:
>
> [PATCH] Add atomic_long_xchg() and atomic_long_cmpxchg() wrappers
> 
> ...
>  
> +static inline long atomic_long_xchg(atomic_long_t *l, long val)
> +{
> + atomic64_t *v = (atomic64_t *)l;
> + return atomic64_xchg(v, val);

All we need now is some implementations of atomic64_xchg()  ;)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Eric Dumazet

Andrew Morton a écrit :

Eric Dumazet <[EMAIL PROTECTED]> wrote:

An advantage of retaining a spinlock in percpu_counter is that if accuracy
is needed at a low rate (say, /proc reading) we can take the lock and then
go spill each CPU's local count into the main one.  It would need to be a
very low rate though.   Or we make the cpu-local counters atomic too.

We might use atomic_long_t only (and no spinlocks)


Yup, that's it.


Something like this ?



It'd be a lot neater if we had atomic_long_xchg().


You are my guest :)

[PATCH] Add atomic_long_xchg() and atomic_long_cmpxchg() wrappers

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>


--- a/include/asm-generic/atomic.h  2006-01-28 02:59:49.0 +0100
+++ b/include/asm-generic/atomic.h  2006-01-28 02:57:36.0 +0100
@@ -66,6 +66,18 @@
atomic64_sub(i, v);
 }
 
+static inline long atomic_long_xchg(atomic_long_t *l, long val)
+{
+   atomic64_t *v = (atomic64_t *)l;
+   return atomic64_xchg(v, val);
+}
+
+static inline long atomic_long_cmpxchg(atomic_long_t *l, long old, long new)
+{
+   atomic64_t *v = (atomic64_t *)l;
+   return atomic64_cmpxchg(v, old, new);
+}
+
 #else
 
 typedef atomic_t atomic_long_t;
@@ -113,5 +125,17 @@
atomic_sub(i, v);
 }
 
+static inline long atomic_long_xchg(atomic_long_t *l, long val)
+{
+   atomic_t *v = (atomic_t *)l;
+   return atomic_xchg(v, val);
+}
+
+static inline long atomic_long_cmpxchg(atomic_long_t *l, long old, long new)
+{
+   atomic_t *v = (atomic_t *)l;
+   return atomic_cmpxchg(v, old, new);
+}
+
 #endif
 #endif


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Andrew Morton
Eric Dumazet <[EMAIL PROTECTED]> wrote:
>
> > 
> > An advantage of retaining a spinlock in percpu_counter is that if accuracy
> > is needed at a low rate (say, /proc reading) we can take the lock and then
> > go spill each CPU's local count into the main one.  It would need to be a
> > very low rate though.   Or we make the cpu-local counters atomic too.
> 
> We might use atomic_long_t only (and no spinlocks)

Yup, that's it.

> Something like this ?
> 

It'd be a lot neater if we had atomic_long_xchg().

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Ravikiran G Thirumalai
On Sat, Jan 28, 2006 at 12:21:07AM +0100, Eric Dumazet wrote:
> Ravikiran G Thirumalai a écrit :
> >On Fri, Jan 27, 2006 at 11:30:23PM +0100, Eric Dumazet wrote:
> 
> Why not use a boot time allocated percpu area (as done today in 
> setup_per_cpu_areas()), but instead of reserving extra space for module's 
> percpu data, being able to serve alloc_percpu() from this reserved area (ie 
> no kmalloced data anymore), and keeping your

At that time ia64 placed a limit on the max size of per-CPU area 
(PERCPU_ENOUGH_ROOM).  I think that the limit is still there, But hopefully
64K per-CPU should be enough for static + dynamic + modules?

Let me do a allyesconfig on my box and verify.

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Eric Dumazet

Eric Dumazet a écrit :

Andrew Morton a écrit :

Eric Dumazet <[EMAIL PROTECTED]> wrote:

Ravikiran G Thirumalai a écrit :

On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote:

Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
which can be assumed as not frequent.  At 
sk_stream_mem_schedule(), read_sockets_allocated() is invoked only 
certain conditions, under memory pressure -- on a large CPU count 
machine, you'd have large memory, and I don't think 
read_sockets_allocated would get called often.  It did not atleast 
on our 8cpu/16G box.  So this should be OK I think.
That being said, the percpu_counters aren't a terribly successful 
concept

and probably do need a revisit due to the high inaccuracy at high CPU
counts.  It might be better to do some generic version of 
vm_acct_memory()

instead.
AFAICS vm_acct_memory is no better.  The deviation on large cpu 
counts is the same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ...

Ah... yes you are right, I read min(16, NR_CPUS*2)


So did I ;)

I wonder if it is not a typo... I mean, I understand the more cpus 
you have, the less updates on central atomic_t is desirable, but a 
quadratic offset seems too much...


I'm not sure whether it was a mistake or if I intended it and didn't 
do the

sums on accuracy :(

An advantage of retaining a spinlock in percpu_counter is that if 
accuracy
is needed at a low rate (say, /proc reading) we can take the lock and 
then

go spill each CPU's local count into the main one.  It would need to be a
very low rate though.   Or we make the cpu-local counters atomic too.


We might use atomic_long_t only (and no spinlocks)
Something like this ?




struct percpu_counter {
atomic_long_t count;
atomic_long_t *counters;
};

#ifdef CONFIG_SMP
void percpu_counter_mod(struct percpu_counter *fbc, long amount)
{
long old, new;
atomic_long_t *pcount;

pcount = per_cpu_ptr(fbc->counters, get_cpu());
start:
old = atomic_long_read(pcount);
new = old + amount;
if (new >= FBC_BATCH || new <= -FBC_BATCH) {
if (unlikely(atomic_long_cmpxchg(pcount, old, 0) != old))
goto start;
atomic_long_add(new, &fbc->count);
} else
atomic_long_add(amount, pcount);

put_cpu();
}
EXPORT_SYMBOL(percpu_counter_mod);

long percpu_counter_read_accurate(struct percpu_counter *fbc)
{
long res = 0;
int cpu;
atomic_long_t *pcount;

for_each_cpu(cpu) {
pcount = per_cpu_ptr(fbc->counters, cpu);
/* dont dirty cache line if not necessary */
if (atomic_long_read(pcount))
res += atomic_long_xchg(pcount, 0);
}


atomic_long_add(res, &fbc->count);
res = atomic_long_read(&fbc->count);


return res;
}
EXPORT_SYMBOL(percpu_counter_read_accurate);
#endif /* CONFIG_SMP */



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Andrew Morton
Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
>
> On Fri, Jan 27, 2006 at 03:08:47PM -0800, Andrew Morton wrote:
> > Andrew Morton <[EMAIL PROTECTED]> wrote:
> > >
> > > Oh, and because vm_acct_memory() is counting a singleton object, it can 
> > > use
> > > DEFINE_PER_CPU rather than alloc_percpu(), so it saves on a bit of kmalloc
> > > overhead.
> > 
> > Actually, I don't think that's true.  we're allocating a sizeof(long) with
> > kmalloc_node() so there shouldn't be memory wastage.
> 
> Oh yeah there is. Each dynamic per-cpu object would have been  atleast
> (NR_CPUS * sizeof (void *) + num_cpus_possible * cacheline_size ).  
> Now kmalloc_node will fall back on size-32 for allocation of long, so
> replace the cacheline_size above with 32 -- which then means dynamic per-cpu
> data are not on a cacheline boundary anymore (most modern cpus have 
> 64byte/128 
> byte cache lines) which means per-cpu data could end up false shared
> 

OK.  But isn't the core of the problem the fact that __alloc_percpu() is
using kmalloc_node() rather than a (new, as-yet-unimplemented)
kmalloc_cpu()?  kmalloc_cpu() wouldn't need the L1 cache alignment.

It might be worth creating just a small number of per-cpu slabs (4-byte,
8-byte).  A kmalloc_cpu() would just need a per-cpu array of
kmem_cache_t*'s and it'd internally use kmalloc_node(cpu_to_node), no?

Or we could just give __alloc_percpu() a custom, hand-rolled,
not-cacheline-padded sizeof(long) slab per CPU and use that if (size ==
sizeof(long)).  Or something.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Eric Dumazet

Andrew Morton a écrit :

Eric Dumazet <[EMAIL PROTECTED]> wrote:

Ravikiran G Thirumalai a écrit :

On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote:

Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
which can be assumed as not frequent.  
At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only 
certain conditions, under memory pressure -- on a large CPU count machine, 
you'd have large memory, and I don't think read_sockets_allocated would get 
called often.  It did not atleast on our 8cpu/16G box.  So this should be OK 
I think.

That being said, the percpu_counters aren't a terribly successful concept
and probably do need a revisit due to the high inaccuracy at high CPU
counts.  It might be better to do some generic version of vm_acct_memory()
instead.
AFAICS vm_acct_memory is no better.  The deviation on large cpu counts is the 
same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ...

Ah... yes you are right, I read min(16, NR_CPUS*2)


So did I ;)

I wonder if it is not a typo... I mean, I understand the more cpus you have, 
the less updates on central atomic_t is desirable, but a quadratic offset 
seems too much...


I'm not sure whether it was a mistake or if I intended it and didn't do the
sums on accuracy :(

An advantage of retaining a spinlock in percpu_counter is that if accuracy
is needed at a low rate (say, /proc reading) we can take the lock and then
go spill each CPU's local count into the main one.  It would need to be a
very low rate though.   Or we make the cpu-local counters atomic too.


We might use atomic_long_t only (and no spinlocks)
Something like this ?

struct percpu_counter {
atomic_long_t count;
atomic_long_t *counters;
};

#ifdef CONFIG_SMP
void percpu_counter_mod(struct percpu_counter *fbc, long amount)
{
long old, new;
atomic_long_t *pcount;

pcount = per_cpu_ptr(fbc->counters, get_cpu());
start:
old = atomic_long_read(pcount);
new = old + amount;
if (new >= FBC_BATCH || new <= -FBC_BATCH) {
if (unlikely(atomic_long_cmpxchg(pcount, old, 0) != old))
goto start;
atomic_long_add(new, &fbc->count);
} else
atomic_long_add(amount, pcount);

put_cpu();
}
EXPORT_SYMBOL(percpu_counter_mod);

long percpu_counter_read_accurate(struct percpu_counter *fbc)
{
long res = 0;
int cpu;
atomic_long_t *pcount;

for_each_cpu(cpu) {
pcount = per_cpu_ptr(fbc->counters, cpu);
/* dont dirty cache line if not necessary */
if (atomic_long_read(pcount))
res += atomic_long_xchg(pcount, 0);
}
return res;
}
EXPORT_SYMBOL(percpu_counter_read_accurate);
#endif /* CONFIG_SMP */



Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Ravikiran G Thirumalai
On Fri, Jan 27, 2006 at 03:08:47PM -0800, Andrew Morton wrote:
> Andrew Morton <[EMAIL PROTECTED]> wrote:
> >
> > Oh, and because vm_acct_memory() is counting a singleton object, it can use
> > DEFINE_PER_CPU rather than alloc_percpu(), so it saves on a bit of kmalloc
> > overhead.
> 
> Actually, I don't think that's true.  we're allocating a sizeof(long) with
> kmalloc_node() so there shouldn't be memory wastage.

Oh yeah there is. Each dynamic per-cpu object would have been  atleast
(NR_CPUS * sizeof (void *) + num_cpus_possible * cacheline_size ).  
Now kmalloc_node will fall back on size-32 for allocation of long, so
replace the cacheline_size above with 32 -- which then means dynamic per-cpu
data are not on a cacheline boundary anymore (most modern cpus have 64byte/128 
byte cache lines) which means per-cpu data could end up false shared

Kiran
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Eric Dumazet

Ravikiran G Thirumalai a écrit :

On Fri, Jan 27, 2006 at 11:30:23PM +0100, Eric Dumazet wrote:

There are several issues here :

alloc_percpu() current implementation is a a waste of ram. (because it uses 
slab allocations that have a minimum size of 32 bytes)


Oh there was a solution for that :).  


http://lwn.net/Articles/119532/

I can quickly revive it if there is interest.



Well, nice work ! :)

Maybe a litle bit complicated if expected percpu space is 50 KB per cpu ?

Why not use a boot time allocated percpu area (as done today in 
setup_per_cpu_areas()), but instead of reserving extra space for module's 
percpu data, being able to serve alloc_percpu() from this reserved area (ie no 
kmalloced data anymore), and keeping your


#define per_cpu_ptr(ptr, cpu)  ((__typeof__(ptr)) \
(RELOC_HIDE(ptr,  PCPU_BLKSIZE * cpu)))

Some code from kernel/module.c could be reworked to serve both as an allocator 
when a module percpudata must be relocated (insmod)/freed (rmmod), and serve 
alloc_percpu() for 'dynamic allocations'


Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Andrew Morton
Eric Dumazet <[EMAIL PROTECTED]> wrote:
>
> Ravikiran G Thirumalai a écrit :
> > On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote:
> >> Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
> >>> which can be assumed as not frequent.  
> >>> At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only 
> >>> certain conditions, under memory pressure -- on a large CPU count 
> >>> machine, 
> >>> you'd have large memory, and I don't think read_sockets_allocated would 
> >>> get 
> >>> called often.  It did not atleast on our 8cpu/16G box.  So this should be 
> >>> OK 
> >>> I think.
> >> That being said, the percpu_counters aren't a terribly successful concept
> >> and probably do need a revisit due to the high inaccuracy at high CPU
> >> counts.  It might be better to do some generic version of vm_acct_memory()
> >> instead.
> > 
> > AFAICS vm_acct_memory is no better.  The deviation on large cpu counts is 
> > the 
> > same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ...
> 
> Ah... yes you are right, I read min(16, NR_CPUS*2)

So did I ;)

> I wonder if it is not a typo... I mean, I understand the more cpus you have, 
> the less updates on central atomic_t is desirable, but a quadratic offset 
> seems too much...

I'm not sure whether it was a mistake or if I intended it and didn't do the
sums on accuracy :(

An advantage of retaining a spinlock in percpu_counter is that if accuracy
is needed at a low rate (say, /proc reading) we can take the lock and then
go spill each CPU's local count into the main one.  It would need to be a
very low rate though.   Or we make the cpu-local counters atomic too.

Certainly it's sensible to delegate the tuning to the creator of the
percpu_counter, but it'll be a difficult thing to get right.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Andrew Morton
Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
>
> On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote:
> > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
> > >
> > > which can be assumed as not frequent.  
> > > At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only 
> > > certain conditions, under memory pressure -- on a large CPU count 
> > > machine, 
> > > you'd have large memory, and I don't think read_sockets_allocated would 
> > > get 
> > > called often.  It did not atleast on our 8cpu/16G box.  So this should be 
> > > OK 
> > > I think.
> > 
> > That being said, the percpu_counters aren't a terribly successful concept
> > and probably do need a revisit due to the high inaccuracy at high CPU
> > counts.  It might be better to do some generic version of vm_acct_memory()
> > instead.
> 
> AFAICS vm_acct_memory is no better.  The deviation on large cpu counts is the 
> same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ...

I suppose so.  Except vm_acct_memory() has

#define ACCT_THRESHOLD  max(16, NR_CPUS * 2)

But if we were to perform similar tuning to percpu_counter, yes, they're
pretty similar.

Oh, and because vm_acct_memory() is counting a singleton object, it can use
DEFINE_PER_CPU rather than alloc_percpu(), so it saves on a bit of kmalloc
overhead.


> > 
> > If the benchmarks say that we need to.  If we cannot observe any problems
> > in testing of existing code and if we can't demonstrate any benefit from
> > the patched code then one option is to go off and do something else ;)
> 
> We first tried plain per-CPU counters for memory_allocated, found that reads
> on memory_allocated was causing cacheline transfers, and then
> switched over to batching.  So batching reads is useful.  To avoid
> inaccuracy, we can maybe change percpu_counter_init to:
> 
> void percpu_counter_init(struct percpu_counter *fbc, int maxdev)
> 
> the percpu batching limit would then be maxdev/num_possible_cpus.  One would
> use batching counters only when both reads and writes are frequent.  With
> the above scheme, we would go fetch cachelines from other cpus for read
> often only on large cpu counts, which is not any worse than the global
> counter alternative, but it would still be beneficial on smaller machines,
> without sacrificing a pre-set deviation.  
> 
> Comments?

Sounds sane.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Andrew Morton
Andrew Morton <[EMAIL PROTECTED]> wrote:
>
> Oh, and because vm_acct_memory() is counting a singleton object, it can use
> DEFINE_PER_CPU rather than alloc_percpu(), so it saves on a bit of kmalloc
> overhead.

Actually, I don't think that's true.  we're allocating a sizeof(long) with
kmalloc_node() so there shouldn't be memory wastage.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Eric Dumazet

Ravikiran G Thirumalai a écrit :

On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote:

Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
which can be assumed as not frequent.  
At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only 
certain conditions, under memory pressure -- on a large CPU count machine, 
you'd have large memory, and I don't think read_sockets_allocated would get 
called often.  It did not atleast on our 8cpu/16G box.  So this should be OK 
I think.

That being said, the percpu_counters aren't a terribly successful concept
and probably do need a revisit due to the high inaccuracy at high CPU
counts.  It might be better to do some generic version of vm_acct_memory()
instead.


AFAICS vm_acct_memory is no better.  The deviation on large cpu counts is the 
same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ...


Ah... yes you are right, I read min(16, NR_CPUS*2)

I wonder if it is not a typo... I mean, I understand the more cpus you have, 
the less updates on central atomic_t is desirable, but a quadratic offset 
seems too much...


Eric

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Ravikiran G Thirumalai
On Fri, Jan 27, 2006 at 11:30:23PM +0100, Eric Dumazet wrote:
> 
> There are several issues here :
> 
> alloc_percpu() current implementation is a a waste of ram. (because it uses 
> slab allocations that have a minimum size of 32 bytes)

Oh there was a solution for that :).  

http://lwn.net/Articles/119532/

I can quickly revive it if there is interest.


Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Ravikiran G Thirumalai
On Fri, Jan 27, 2006 at 12:16:02PM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
> >
> > which can be assumed as not frequent.  
> > At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only 
> > certain conditions, under memory pressure -- on a large CPU count machine, 
> > you'd have large memory, and I don't think read_sockets_allocated would get 
> > called often.  It did not atleast on our 8cpu/16G box.  So this should be 
> > OK 
> > I think.
> 
> That being said, the percpu_counters aren't a terribly successful concept
> and probably do need a revisit due to the high inaccuracy at high CPU
> counts.  It might be better to do some generic version of vm_acct_memory()
> instead.

AFAICS vm_acct_memory is no better.  The deviation on large cpu counts is the 
same as percpu_counters -- (NR_CPUS * NR_CPUS * 2) ...

> 
> If the benchmarks say that we need to.  If we cannot observe any problems
> in testing of existing code and if we can't demonstrate any benefit from
> the patched code then one option is to go off and do something else ;)

We first tried plain per-CPU counters for memory_allocated, found that reads
on memory_allocated was causing cacheline transfers, and then
switched over to batching.  So batching reads is useful.  To avoid
inaccuracy, we can maybe change percpu_counter_init to:

void percpu_counter_init(struct percpu_counter *fbc, int maxdev)

the percpu batching limit would then be maxdev/num_possible_cpus.  One would
use batching counters only when both reads and writes are frequent.  With
the above scheme, we would go fetch cachelines from other cpus for read
often only on large cpu counts, which is not any worse than the global
counter alternative, but it would still be beneficial on smaller machines,
without sacrificing a pre-set deviation.  

Comments?

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Eric Dumazet

Andrew Morton a écrit :

Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:

On Fri, Jan 27, 2006 at 09:53:53AM +0100, Eric Dumazet wrote:

Ravikiran G Thirumalai a écrit :
Change the atomic_t sockets_allocated member of struct proto to a 
per-cpu counter.


Signed-off-by: Pravin B. Shelar <[EMAIL PROTECTED]>
Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>


Hi Ravikiran

If I correctly read this patch, I think there is a scalability problem.

On a big SMP machine, read_sockets_allocated() is going to be a real killer.

Say we have 128 Opterons CPUS in a box.

read_sockets_allocated is being invoked when when /proc/net/protocols is read,
which can be assumed as not frequent.  
At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only 
certain conditions, under memory pressure -- on a large CPU count machine, 
you'd have large memory, and I don't think read_sockets_allocated would get 
called often.  It did not atleast on our 8cpu/16G box.  So this should be OK 
I think.


That being said, the percpu_counters aren't a terribly successful concept
and probably do need a revisit due to the high inaccuracy at high CPU
counts.  It might be better to do some generic version of vm_acct_memory()
instead.


There are several issues here :

alloc_percpu() current implementation is a a waste of ram. (because it uses 
slab allocations that have a minimum size of 32 bytes)


Currently we cannot use per_cpu(&some_object, cpu), so a generic version of 
vm_acct_memory() would need a rework of percpu.h and maybe this is not 
possible on every platform ?


#define per_cpu(var, cpu) (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))

-->

#define per_cpu_name(var) per_cpu__##var
#define per_cpu_addr(var) &per_cpu_name(var)
#define per_cpu(var, cpu) (*RELOC_HIDE(per_cpu_addr(var), __per_cpu_offset[cpu])


But this could render TLS migration difficult...

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Andrew Morton
Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
>
> On Fri, Jan 27, 2006 at 09:53:53AM +0100, Eric Dumazet wrote:
> > Ravikiran G Thirumalai a écrit :
> > >Change the atomic_t sockets_allocated member of struct proto to a 
> > >per-cpu counter.
> > >
> > >Signed-off-by: Pravin B. Shelar <[EMAIL PROTECTED]>
> > >Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
> > >Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>
> > >
> > Hi Ravikiran
> > 
> > If I correctly read this patch, I think there is a scalability problem.
> > 
> > On a big SMP machine, read_sockets_allocated() is going to be a real killer.
> > 
> > Say we have 128 Opterons CPUS in a box.
> 
> read_sockets_allocated is being invoked when when /proc/net/protocols is read,
> which can be assumed as not frequent.  
> At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only 
> certain conditions, under memory pressure -- on a large CPU count machine, 
> you'd have large memory, and I don't think read_sockets_allocated would get 
> called often.  It did not atleast on our 8cpu/16G box.  So this should be OK 
> I think.

That being said, the percpu_counters aren't a terribly successful concept
and probably do need a revisit due to the high inaccuracy at high CPU
counts.  It might be better to do some generic version of vm_acct_memory()
instead.

If the benchmarks say that we need to.  If we cannot observe any problems
in testing of existing code and if we can't demonstrate any benefit from
the patched code then one option is to go off and do something else ;)

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Ravikiran G Thirumalai
On Fri, Jan 27, 2006 at 09:53:53AM +0100, Eric Dumazet wrote:
> Ravikiran G Thirumalai a écrit :
> >Change the atomic_t sockets_allocated member of struct proto to a 
> >per-cpu counter.
> >
> >Signed-off-by: Pravin B. Shelar <[EMAIL PROTECTED]>
> >Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
> >Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>
> >
> Hi Ravikiran
> 
> If I correctly read this patch, I think there is a scalability problem.
> 
> On a big SMP machine, read_sockets_allocated() is going to be a real killer.
> 
> Say we have 128 Opterons CPUS in a box.

read_sockets_allocated is being invoked when when /proc/net/protocols is read,
which can be assumed as not frequent.  
At sk_stream_mem_schedule(), read_sockets_allocated() is invoked only 
certain conditions, under memory pressure -- on a large CPU count machine, 
you'd have large memory, and I don't think read_sockets_allocated would get 
called often.  It did not atleast on our 8cpu/16G box.  So this should be OK 
I think.

There're no 128 CPU Opteron boxes yet afaik ;).

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-27 Thread Eric Dumazet

Ravikiran G Thirumalai a écrit :
Change the atomic_t sockets_allocated member of struct proto to a 
per-cpu counter.


Signed-off-by: Pravin B. Shelar <[EMAIL PROTECTED]>
Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>


Hi Ravikiran

If I correctly read this patch, I think there is a scalability problem.

On a big SMP machine, read_sockets_allocated() is going to be a real killer.

Say we have 128 Opterons CPUS in a box.

You'll need to bring 128 cache lines (plus 8*128 bytes to read the 128 
pointers inside percpu structure)


I think a solution 'a la percpu_counter' is preferable, or even better a 
dedicated per_cpu with a threshold management (see mm/swap.c , function 
vm_acct_memory() to see how vm_committed_space is updated without too bad SMP 
scalability)


Thank you

Eric

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-26 Thread Ravikiran G Thirumalai
Change the atomic_t sockets_allocated member of struct proto to a 
per-cpu counter.

Signed-off-by: Pravin B. Shelar <[EMAIL PROTECTED]>
Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>

Index: linux-2.6.16-rc1/include/net/sock.h
===
--- linux-2.6.16-rc1.orig/include/net/sock.h2006-01-24 14:37:45.0 
-0800
+++ linux-2.6.16-rc1/include/net/sock.h 2006-01-24 17:35:34.0 -0800
@@ -543,7 +543,7 @@ struct proto {
/* Memory pressure */
void(*enter_memory_pressure)(void);
struct percpu_counter   *memory_allocated;  /* Current allocated 
memory. */
-   atomic_t*sockets_allocated; /* Current number of 
sockets. */
+   int *sockets_allocated; /* Current number of 
sockets (percpu counter). */
 
/*
 * Pressure flag: try to collapse.
@@ -579,6 +579,24 @@ struct proto {
} stats[NR_CPUS];
 };
 
+static inline int read_sockets_allocated(struct proto *prot)
+{
+   int total = 0;
+   int cpu;
+   for_each_cpu(cpu)
+   total += *per_cpu_ptr(prot->sockets_allocated, cpu);
+   return total;
+}
+
+static inline void mod_sockets_allocated(int *sockets_allocated, int count)
+{
+   (*per_cpu_ptr(sockets_allocated, get_cpu())) += count;
+   put_cpu();
+}
+
+#define inc_sockets_allocated(c) mod_sockets_allocated(c, 1)
+#define dec_sockets_allocated(c) mod_sockets_allocated(c, -1)
+
 extern int proto_register(struct proto *prot, int alloc_slab);
 extern void proto_unregister(struct proto *prot);
 
Index: linux-2.6.16-rc1/include/net/tcp.h
===
--- linux-2.6.16-rc1.orig/include/net/tcp.h 2006-01-24 14:37:45.0 
-0800
+++ linux-2.6.16-rc1/include/net/tcp.h  2006-01-24 17:05:34.0 -0800
@@ -221,7 +221,7 @@ extern int sysctl_tcp_tso_win_divisor;
 extern int sysctl_tcp_abc;
 
 extern struct percpu_counter tcp_memory_allocated;
-extern atomic_t tcp_sockets_allocated;
+extern int *tcp_sockets_allocated;
 extern int tcp_memory_pressure;
 
 /*
Index: linux-2.6.16-rc1/net/core/sock.c
===
--- linux-2.6.16-rc1.orig/net/core/sock.c   2006-01-24 14:37:45.0 
-0800
+++ linux-2.6.16-rc1/net/core/sock.c2006-01-24 16:47:55.0 -0800
@@ -771,7 +771,7 @@ struct sock *sk_clone(const struct sock 
newsk->sk_sleep  = NULL;
 
if (newsk->sk_prot->sockets_allocated)
-   atomic_inc(newsk->sk_prot->sockets_allocated);
+   
inc_sockets_allocated(newsk->sk_prot->sockets_allocated); 
}
 out:
return newsk;
@@ -1620,7 +1620,7 @@ static void proto_seq_printf(struct seq_
"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c 
%2c %2c %2c %2c %2c %2c\n",
   proto->name,
   proto->obj_size,
-  proto->sockets_allocated != NULL ? 
atomic_read(proto->sockets_allocated) : -1,
+  proto->sockets_allocated != NULL ? 
read_sockets_allocated(proto) : -1,
   proto->memory_allocated != NULL ?
percpu_counter_read(proto->memory_allocated) : 
-1,
   proto->memory_pressure != NULL ? *proto->memory_pressure ? 
"yes" : "no" : "NI",
Index: linux-2.6.16-rc1/net/core/stream.c
===
--- linux-2.6.16-rc1.orig/net/core/stream.c 2006-01-24 14:37:45.0 
-0800
+++ linux-2.6.16-rc1/net/core/stream.c  2006-01-24 16:20:22.0 -0800
@@ -243,7 +243,7 @@ int sk_stream_mem_schedule(struct sock *
return 1;
 
if (!*sk->sk_prot->memory_pressure ||
-   sk->sk_prot->sysctl_mem[2] > 
atomic_read(sk->sk_prot->sockets_allocated) *
+   sk->sk_prot->sysctl_mem[2] > read_sockets_allocated(sk->sk_prot) *
sk_stream_pages(sk->sk_wmem_queued +
atomic_read(&sk->sk_rmem_alloc) 
+
sk->sk_forward_alloc))
Index: linux-2.6.16-rc1/net/ipv4/proc.c
===
--- linux-2.6.16-rc1.orig/net/ipv4/proc.c   2006-01-24 14:37:45.0 
-0800
+++ linux-2.6.16-rc1/net/ipv4/proc.c2006-01-24 16:20:22.0 -0800
@@ -63,7 +63,7 @@ static int sockstat_seq_show(struct seq_
socket_seq_show(seq);
seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %lu\n",
   fold_prot_inuse(&tcp_prot), atomic_read(&tcp_orphan_count),
-  tcp_death_row.tw_count, atomic_read(&tcp_sockets_allocated),
+  tcp_death_row.tw_count, read_sockets_allocated(&tcp_prot),
   percpu_co