subject:"Re\: \[naviserver\-devel\] Quest for malloc"

Re: [naviserver-devel] Quest for malloc

2007-01-16 Thread Gustaf Neumann


Hi Jeff,

we are aware that the funciton is essentially an integer log2.
The chosen C-based variant is acually faster and more general than
what you have included (it needs only max 2 shift operations for
the relevant range) but the assembler based variant is hard to beat
and yields another 3% for the performance of the benchmark
on top of the fastest C version. Thanks for that!

-gustaf

Jeff Rogers schrieb:

I don't think anyone has pointed this out yet, but this is a logarithm
in base 2 (log2), and there are a fair number of implementations of this 
available; for maximum performance there are assembly implementations 
using 'bsr' on x86 architectures, such as this one from google's tcmalloc:

Re: [naviserver-devel] Quest for malloc

2007-01-16 Thread Jeff Rogers


Gustaf Neumann wrote:

This is most probably the best variabt so far, and not complicated, such a
optimizer can do "the right thing" easily. sorry for the many versions.. 


-gustaf


{ unsigned register int s = (size-1) >> 3;
  while (s>1) { s >>= 1; bucket++; }
}

  if (bucket > NBUCKETS) {
bucket = NBUCKETS;
  }


I don't think anyone has pointed this out yet, but this is a logarithm 
in base 2 (log2), and there are a fair number of implementations of this 
available; for maximum performance there are assembly implementations 
using 'bsr' on x86 architectures, such as this one from google's tcmalloc:


// Return floor(log2(n)) for n > 0.
#if (defined __i386__ || defined __x86_64__) && defined __GNUC__
static inline int LgFloor(size_t n) {
  // "ro" for the input spec means the input can come from either a
  // register ("r") or offsetable memory ("o").
  size_t result;
  __asm__("bsr  %1, %0"
  : "=r" (result)   // Output spec
  : "ro" (n)// Input spec
  : "cc"// Clobbers condition-codes
  );
  return result;
}
#else
// Note: the following only works for "n"s that fit in 32-bits, but
// that is fine since we only use it for small sizes.
static inline int LgFloor(size_t n) {
  int log = 0;
  for (int i = 4; i >= 0; --i) {
int shift = (1 << i);
size_t x = n >> shift;
if (x != 0) {
  n = x;
  log += shift;
}
  }
  ASSERT(n == 1);
  return log;
}
#endif

(Disclaimer - this comment is based on my explorations of zippy, not vt, 
so the logic may be entirely different)  If this log2(requested_size) is 
used to translate directly index into the bucket table that necessarily 
restricts you to having power-of-2 bucket sizes, meaning you allocate on 
average nearly 50% more than requested (i.e., nearly 33% of allocated 
memory is overhead/wasted).  Adding more, closer-spaced buckets adds to 
the base footprint but possibly reduces the max usage by dropping the 
wasted space.  I believe tcmalloc uses buckets spaced so that the 
average waste is only 12.5%.


-J

Re: [naviserver-devel] Quest for malloc

2007-01-16 Thread Zoran Vasiljevic



Am 16.01.2007 um 15:52 schrieb Zoran Vasiljevic:



You see, even we (i.e. Mike) noticed one glitch in the
test program that make Zippy look ridiculous on the Mac,
although it wasn't.


Hmhmhmh... I must have done something very wrong :-(

When I now repeat the tests on Mac/Zippy, even with the
size limited to 16000 bytes, it still performs miserably.

For just one thread, it gives "decent" values
(although still 2.5 times slower than VT).
For two threads, it goes down to about 1/5th
and so on...

I have asked Gustaf to try to reproduce that
on his Mac, as I slowly start to see white mice
(no, I never drink _any_ alkohol)...

If Gustaf confirms my findings, then we are still
back where we were with Zippy.

And yes, I have disabled that block splitting Mike
was talking about in his email. So it is not that.
And... it is not size of the allocation (> 16284)
as I fixed that as well...

Background: I wanted to update the README file with
new performance values and found out that Zippy
isn't changed, although I thought it was fixed with
that size change... Hmmm...

Zoran

Re: [naviserver-devel] Quest for malloc

2007-01-16 Thread Vlad Seryakov

Yes, it is combined version, but Tcl version is slightly different and 
Zoran took it over to maintain, in my tarball i include both, we do 
experiments in different directions and then combine best results.


Also the intention was to try to include it in Tcl itself.

Stephen Deasey wrote:

On 1/16/07, Stephen Deasey <[EMAIL PROTECTED]> wrote:

On 1/16/07, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote:

Am 16.01.2007 um 12:18 schrieb Stephen Deasey:


  vtmalloc  <-- add this

It's there. Everybody can now contribute, if needed.


Rocking.

I suggest putting the 0.0.3 tarball up on sourceforge, announcing on
Freshmeat, and cross-posting on the aolserver list.  You really want
random people with their random workloads on random OS to beat on
this.  I don't know if the pool of people here is large enough for
that...

I'm sure there's a lot of other people who would be interested in
this, if they knew about it.  Should probably cross-post here, for
example:

http://wiki.tcl.tk/9683 - Why Do Programs Take Up So Much Memory?



Vlad's already on the ball...

http://freshmeat.net/projects/vtmalloc/

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel



--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/

Re: [naviserver-devel] Quest for malloc

2007-01-16 Thread Zoran Vasiljevic



Am 16.01.2007 um 15:41 schrieb Stephen Deasey:



I suggest putting the 0.0.3 tarball up on sourceforge, announcing on
Freshmeat, and cross-posting on the aolserver list.  You really want
random people with their random workloads on random OS to beat on
this.  I don't know if the pool of people here is large enough for
that...

I'm sure there's a lot of other people who would be interested in
this, if they knew about it.  Should probably cross-post here, for
example:

http://wiki.tcl.tk/9683 - Why Do Programs Take Up So Much Memory?


The plan was to beat this beast first in the "family",
then go to the next village (aol-list) and then visit
the next town (tcl-core list), in that sequence.

You see, even we (i.e. Mike) noticed one glitch in the
test program that make Zippy look ridiculous on the Mac,
although it wasn't. So we now have enough experience
to go visit our neighbours and see what they'll say.
On positive feedback, the next is Tcl core list. There
I expect most fierce opposition to any change (which
is understandable, given the size of the group of
involved people and the kind of the change).

Cheers
Zoran

Re: [naviserver-devel] Quest for malloc

2007-01-16 Thread Stephen Deasey

On 1/16/07, Stephen Deasey <[EMAIL PROTECTED]> wrote:

On 1/16/07, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote:
>
> Am 16.01.2007 um 12:18 schrieb Stephen Deasey:
>
> >   vtmalloc  <-- add this
>
> It's there. Everybody can now contribute, if needed.
>

Rocking.

I suggest putting the 0.0.3 tarball up on sourceforge, announcing on
Freshmeat, and cross-posting on the aolserver list.  You really want
random people with their random workloads on random OS to beat on
this.  I don't know if the pool of people here is large enough for
that...

I'm sure there's a lot of other people who would be interested in
this, if they knew about it.  Should probably cross-post here, for
example:

http://wiki.tcl.tk/9683 - Why Do Programs Take Up So Much Memory?

Vlad's already on the ball...

   http://freshmeat.net/projects/vtmalloc/

Re: [naviserver-devel] Quest for malloc

2007-01-16 Thread Stephen Deasey


On 1/16/07, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote:


Am 16.01.2007 um 12:18 schrieb Stephen Deasey:

>   vtmalloc  <-- add this

It's there. Everybody can now contribute, if needed.



Rocking.

I suggest putting the 0.0.3 tarball up on sourceforge, announcing on
Freshmeat, and cross-posting on the aolserver list.  You really want
random people with their random workloads on random OS to beat on
this.  I don't know if the pool of people here is large enough for
that...

I'm sure there's a lot of other people who would be interested in
this, if they knew about it.  Should probably cross-post here, for
example:

   http://wiki.tcl.tk/9683 - Why Do Programs Take Up So Much Memory?

Re: [naviserver-devel] Quest for malloc

2007-01-16 Thread Zoran Vasiljevic



Am 16.01.2007 um 12:18 schrieb Stephen Deasey:


  vtmalloc  <-- add this


It's there. Everybody can now contribute, if needed.

Re: [naviserver-devel] Quest for malloc

2007-01-16 Thread Stephen Deasey

On 1/16/07, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote:

Am 16.01.2007 um 10:37 schrieb Stephen Deasey:

>
> Can you import this into CVS?  Top level.
>

You mean the tclThreadAlloc.c file on top-level
of the naviserver project?

The whole thing: README, licence, tests etc.  By top level, I just
mean not in the modules directory, because it isn't one.  So, CVS:

 naviserver
 modules
 website
 vtmalloc  <-- add this

Unless you're planning to push this upstream in the next week or so.
Or you really want to host this on your own website.

It's a shame to have good work hidden in random places.

Re: [naviserver-devel] Quest for malloc

2007-01-16 Thread Gustaf Neumann


Zoran Vasiljevic schrieb:

Guess what: it is _slower_ now then the

   s = (size-1) >> 3;
   while (s>1) {s >>= 1; bucket++;}

I tend to like that one as it is really neat.
It will also better illustrate what is being
done.
  

this is the last for today. It is the unrolled variant, with less tests,
and still human readable. It should be faster than the unrolled
while variants

-gustaf

  {   unsigned register int s = (size-1) >> 4;

   while (s >= 0x1000) {
 s >>= 12;
 bucket += 12;
   }
   if (s >= 0x0800) { s >>= 11; bucket += 11; } else
   if (s >= 0x0400) { s >>= 10; bucket += 10; } else
   if (s >= 0x0200) { s >>= 9;  bucket += 9;  } else
   if (s >= 0x0100) { s >>= 8;  bucket += 8;  } else
   if (s >= 0x0080) { s >>= 7;  bucket += 7;  } else
   if (s >= 0x0040) { s >>= 6;  bucket += 6;  } else 
   if (s >= 0x0020) { s >>= 5;  bucket += 5;  } else

   if (s >= 0x0010) { s >>= 4;  bucket += 4;  } else
   if (s >= 0x0008) { s >>= 3;  bucket += 3;  } else
   if (s >= 0x0004) { s >>= 2;  bucket += 2;  } else
   if (s >= 0x0002) { s >>= 1;  bucket += 1;  }

   if (s >= 1) {
 bucket++;
   }
  
   if (bucket > NBUCKETS) {

 bucket = NBUCKETS;
   }
   }
#

Re: [naviserver-devel] Quest for malloc

2007-01-16 Thread Zoran Vasiljevic



Am 16.01.2007 um 10:37 schrieb Stephen Deasey:



Can you import this into CVS?  Top level.



You mean the tclThreadAlloc.c file on top-level
of the naviserver project?

Re: [naviserver-devel] Quest for malloc

2007-01-16 Thread Zoran Vasiljevic



Am 16.01.2007 um 11:24 schrieb Gustaf Neumann:


if all cases are used, all but the first loops are executed
mostly once and could be changed into ifs... i will send
you with a separate mail on such variant, but i am running
currently out of battery.



Guess what: it is _slower_ now then the

  s = (size-1) >> 3;
  while (s>1) {s >>= 1; bucket++;}

I tend to like that one as it is really neat.
It will also better illustrate what is being
done.

Watch: _slower_ means about 1-2%, so I do not
believe we need to improve on that any more.
The above version is I believe most "opportune"
as it is readable (thus understandable) and
very fast.

Re: [naviserver-devel] Quest for malloc

2007-01-16 Thread Gustaf Neumann


Zoran Vasiljevic schrieb:


Am 16.01.2007 um 10:46 schrieb Gustaf Neumann:

This is most probably the best variabt so far, and not complicated, 
such a

optimizer can do "the right thing" easily. sorry for the many versions..
-gustaf


   { unsigned register int s = (size-1) >> 3;
 while (s>1) { s >>= 1; bucket++; }
   }

 if (bucket > NBUCKETS) {
   bucket = NBUCKETS;
 }


You'd be surprised that this one

i am. that's the story of the unrolled loops.

Btw, the version you have listed as the fastest
has wrong boundary tests (but still gives the same
result.

below is is corrected version, which needs up to
one mio max 2 shift operations.

The nice thing of this code (due to staggered whiles)
is that any of the while loops (execpt the last)
can be removed and the code works still correctly
(but needs more shift operations). that's the
reason, why yesterdays version actually works.

if all cases are used, all but the first loops are executed
mostly once and could be changed into ifs... i will send
you with a separate mail on such variant, but i am running
currently out of battery.

   while (s >= 0x1000) {
 s >>= 12;
 bucket += 12;
   }
   while (s >= 0x0800) {
 s >>=  11;
 bucket += 11;
   }
   while (s >= 0x0400) {
 s >>=  10;
 bucket += 10;
   }
   while (s >= 0x200) {
 s >>=  9;
 bucket += 9;
   }
   while (s >= 0x0100) {
 s >>=  8;
 bucket += 8;
   } 
   while (s >= 0x80) {

 s >>= 7;
 bucket += 7;
   } 
   while (s >= 0x40) {

 s >>=  6;
 bucket += 6;
   }  
   while (s >= 0x20) {

 s >>=  5;
 bucket += 5;
   }
   while (s >= 0x10) {
 s >>=  4;
 bucket += 4;
   }
   while (s >= 0x08) {
 s >>=  3;
 bucket += 3;
   }
   while (s >= 0x04) {
 s >>=  2;
 bucket += 2;
   }
   while(s >= 1) {
 s >>= 1;
 bucket++;
   }
  
   if (bucket > NBUCKETS) {

 bucket = NBUCKETS;
   }
  







Test Tcl allocator with 4 threads, 16000 records ...
This allocator achieves 10098495 ops/sec under 4 threads
Press return to exit (observe the current memory footprint!)


whereas this one:

 s = (size-1) >> 3;
 while (s>1) { s >>= 1; bucket++;}

gives:

Test Tcl allocator with 4 threads, 16000 records ...
This allocator achieves 9720847 ops/sec under 4 threads
Press return to exit (observe the current memory footprint!)

That is (10098495-9720847/10098495)*100 = 3% less

That is all measured on Linux. I haven't done it on the Mac
and on the Sun yet. I now have all versions inside and will
play a little on each plaform to see which one operates best
overall. The latest one is more appealing because of the
siplicitly of the code, so we can close an eye on that 3%
I guess.

Cheers
Zoran

Re: [naviserver-devel] Quest for malloc

2007-01-16 Thread Zoran Vasiljevic



Am 16.01.2007 um 10:46 schrieb Gustaf Neumann:


s = (size-1) >> 3;
  while (s>1) { s >>= 1; bucket++;


On Linux and Solaris (both x86 machines)
the "long" version:

s = (size-1) >> 4;
while (s > 0xFF) {
s = s >> 5;
bucket += 5;
}
while (s > 0x0F) {
s = s >> 4;
bucket += 4;
}
...

is faster then the "short" above.
On Mac OSX it is the same (no difference).

Look the Sun Solaris 10 (x86 box):

(the "short" version)
Test Tcl allocator with 4 threads, 16000 records ...
This allocator achieves 13753084 ops/sec under 4 threads
Press return to exit (observe the current memory footprint!)


(the "long" version)
-bash-3.00$ ./memtest

Test Tcl allocator with 4 threads, 16000 records ...
This allocator achieves 14341236 ops/sec under 4 threads
Press return to exit (observe the current memory footprint!)

That is ((14341236-13753084)/14341236)*100 = 4%

On Linux we had about 3% improvement. On Sun about 4% and
on Mac OSX none. Note: all were x86 (Intel, AMD) machines
just different OS and GHz-count.

When we go back to the "slow" (original) version:

Test Tcl allocator with 4 threads, 16000 records ...
This allocator achieves 13474091 ops/sec under 4 threads
Press return to exit (observe the current memory footprint!)

We get ((14341236-13474091)/14341236)*100 = 6% improvement.

Cheers
Zoran

Re: [naviserver-devel] Quest for malloc

2007-01-16 Thread Zoran Vasiljevic



Am 16.01.2007 um 10:46 schrieb Gustaf Neumann:

This is most probably the best variabt so far, and not complicated,  
such a
optimizer can do "the right thing" easily. sorry for the many  
versions..

-gustaf


   { unsigned register int s = (size-1) >> 3;
 while (s>1) { s >>= 1; bucket++; }
   }

 if (bucket > NBUCKETS) {
   bucket = NBUCKETS;
 }


You'd be surprised that this one

s = (size-1) >> 4;
while (s > 0xFF) {
s = s >> 5;
bucket += 5;
}
while (s > 0x0F) {
s = s >> 4;
bucket += 4;
}
...

gives:

Test Tcl allocator with 4 threads, 16000 records ...
This allocator achieves 10098495 ops/sec under 4 threads
Press return to exit (observe the current memory footprint!)


whereas this one:

 s = (size-1) >> 3;
 while (s>1) { s >>= 1; bucket++;}

gives:

Test Tcl allocator with 4 threads, 16000 records ...
This allocator achieves 9720847 ops/sec under 4 threads
Press return to exit (observe the current memory footprint!)

That is (10098495-9720847/10098495)*100 = 3% less

That is all measured on Linux. I haven't done it on the Mac
and on the Sun yet. I now have all versions inside and will
play a little on each plaform to see which one operates best
overall. The latest one is more appealing because of the
siplicitly of the code, so we can close an eye on that 3%
I guess.

Cheers
Zoran

Re: [naviserver-devel] Quest for malloc

2007-01-16 Thread Gustaf Neumann


This is most probably the best variabt so far, and not complicated, such a
optimizer can do "the right thing" easily. sorry for the many versions.. 


-gustaf


   { unsigned register int s = (size-1) >> 3;
 while (s>1) { s >>= 1; bucket++; }
   }

 if (bucket > NBUCKETS) {
   bucket = NBUCKETS;
 }

Re: [naviserver-devel] Quest for malloc

2007-01-16 Thread Stephen Deasey

On 1/16/07, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote:

Am 15.01.2007 um 22:37 schrieb Zoran Vasiljevic:

>
> Am 15.01.2007 um 22:22 schrieb Mike:
>
>>
>> Zoran, I believe you misunderstood.  The "patch" above limits blocks
>> allocated by your tester to 16000 instead of 16384 blocks.  The
>> reason
>> for this is that Zippy's "largest bucket" is configured to be
>> 16284-sizeof(Block) bytes (note the "2" in 16_2_84 is _NOT_ a typo).
>> By making uniformly random requests sizes up to 16_3_84, you are
>> causing Zippy to fall back to system malloc for a small fraction of
>> requests, substantially penalizing its performance in these cases.
>
> Ah! That's right. I will fix that.
>
>>
>> You wanted to know why Zippy is slower on your test, this is the
>> reason.  This has substantial impact on FreeBSD and linux, and my
>> guess is that it will have a drammatic effect on Mac OSX.
>
> I will check that tomorrow on my machines.

YES. That did the trick. We have now demystified the behaviour
on the Mac. Indeed, when I limit the max alloc size to below *16284*
bytes, Zippy runs almosts as fast as VT alloc. So, it was my
overlooking of the fact that it was 16284 and not 16K (16384) !!
I wanted to give Zippy a fair chance but I missed that for about
100 bytes. Which made huge difference. Still, it shows again one
of the weaknesses of Zippy: dependence of (potentially suboptimal)
system memory allocator. But that is not to blame on zippy, rather
on weak system malloc, as on the Mac. I guess same could have happened
to us with a slow mmap()/munmap()...

>
>>>
>>> How about adding this into the code?
>>
>> I think the most obvious replacement is just using an if "tree":
>> if (size>0xff) bucket+=8, size&=0xff;
>> if (size>0xf) bucket+=4, size&0xf;
>> ...
>> it takes a minute to get the math right, but the performance gain
>> should be substantial.
>
> Well, I can test that allright. I have the feeling that a tight
> loop as that (will mostly sping 5-12 times) gets well compiled
> in machine code, but it is better to test.

Allright. Gustaf came with this, and it saves about 10% of time:

#if 0
  while (bucket> 4;
 while (s > 0xFF) {
 s = s >> 5;
 bucket += 5;
 }
 while (s > 0x0F) {
 s = s >> 4;
 bucket += 4;
 }
 while (s > 0x08) {
 s = s >> 3;
 bucket += 3;
 }
 while (s > 0x04) {
 s = s >> 2;
 bucket += 2;
 }
 while (s > 0x00) {
 s = s >> 1;
 bucket++;
 }

I will leave the above loop in the code and provide ifdef,
as by looking at the below it is hard to understand what
is really happening. But it works and it works fine.

Cheers
Zoran

Can you import this into CVS?  Top level.

Re: [naviserver-devel] Quest for malloc

2007-01-16 Thread Zoran Vasiljevic



Am 15.01.2007 um 22:37 schrieb Zoran Vasiljevic:



Am 15.01.2007 um 22:22 schrieb Mike:



Zoran, I believe you misunderstood.  The "patch" above limits blocks
allocated by your tester to 16000 instead of 16384 blocks.  The  
reason

for this is that Zippy's "largest bucket" is configured to be
16284-sizeof(Block) bytes (note the "2" in 16_2_84 is _NOT_ a typo).
By making uniformly random requests sizes up to 16_3_84, you are
causing Zippy to fall back to system malloc for a small fraction of
requests, substantially penalizing its performance in these cases.


Ah! That's right. I will fix that.



You wanted to know why Zippy is slower on your test, this is the
reason.  This has substantial impact on FreeBSD and linux, and my
guess is that it will have a drammatic effect on Mac OSX.


I will check that tomorrow on my machines.


YES. That did the trick. We have now demystified the behaviour
on the Mac. Indeed, when I limit the max alloc size to below *16284*
bytes, Zippy runs almosts as fast as VT alloc. So, it was my
overlooking of the fact that it was 16284 and not 16K (16384) !!
I wanted to give Zippy a fair chance but I missed that for about
100 bytes. Which made huge difference. Still, it shows again one
of the weaknesses of Zippy: dependence of (potentially suboptimal)
system memory allocator. But that is not to blame on zippy, rather
on weak system malloc, as on the Mac. I guess same could have happened
to us with a slow mmap()/munmap()...





How about adding this into the code?


I think the most obvious replacement is just using an if "tree":
if (size>0xff) bucket+=8, size&=0xff;
if (size>0xf) bucket+=4, size&0xf;
...
it takes a minute to get the math right, but the performance gain
should be substantial.


Well, I can test that allright. I have the feeling that a tight
loop as that (will mostly sping 5-12 times) gets well compiled
in machine code, but it is better to test.


Allright. Gustaf came with this, and it saves about 10% of time:

#if 0
 while (bucket[bucket].blocksize
 ++bucket;
 }
#else
s = (size-1) >> 4;
while (s > 0xFF) {
s = s >> 5;
bucket += 5;
}
while (s > 0x0F) {
s = s >> 4;
bucket += 4;
}
while (s > 0x08) {
s = s >> 3;
bucket += 3;
}
while (s > 0x04) {
s = s >> 2;
bucket += 2;
}
while (s > 0x00) {
s = s >> 1;
bucket++;
}

I will leave the above loop in the code and provide ifdef,
as by looking at the below it is hard to understand what
is really happening. But it works and it works fine.

Cheers
Zoran

Re: [naviserver-devel] Quest for malloc

2007-01-15 Thread Zoran Vasiljevic



Am 15.01.2007 um 20:15 schrieb Stephen Deasey:



Nobody yet gave any reasonable explanation why we are
that fast on Mac OSX compared to any other allocator.
Recall, that was 870.573/70.713.324 ops/sec Zippy/VT
so about 81 times faster, for 16 threads.
Although it really seems like a bug either in the testcode
or in the allocator, I have not been able to verify any.
All is working as it should. So, the mistery remains...




Because Mac OSX SucksMonkeyBawlz() in a tight inner loop?


Actually, Mike was right. The test pattern maxed the size
to slightly above 16000 which turned Zippy back to system
allocator and that alone screwed everything. When I limit
the test program to allocate up to 16000 bytes but not more
the performance of Zippy and VT are almost equal.

So, the only thing that remains is the memory handling.
But, as I stressed many times, our goal was to be +/- 25%
to zippy performance with better memory handling (releasing
memory to OS when possible). I still believe that we achieved
our goal very well.

But it is good to know why the difference on the Mac was so
much higher then elsewhere. I guess if I repeat the Zippy/VT
speed comparison on other platform, with 16000 bytes upper limit
that performance difference will be little or none.

Many thanks to Mike for good observation!

Cheers
Zoran

Re: [naviserver-devel] Quest for malloc

2007-01-15 Thread Zoran Vasiljevic



Am 15.01.2007 um 22:22 schrieb Mike:



Zoran, I believe you misunderstood.  The "patch" above limits blocks
allocated by your tester to 16000 instead of 16384 blocks.  The reason
for this is that Zippy's "largest bucket" is configured to be
16284-sizeof(Block) bytes (note the "2" in 16_2_84 is _NOT_ a typo).
By making uniformly random requests sizes up to 16_3_84, you are
causing Zippy to fall back to system malloc for a small fraction of
requests, substantially penalizing its performance in these cases.


Ah! That's right. I will fix that.



You wanted to know why Zippy is slower on your test, this is the
reason.  This has substantial impact on FreeBSD and linux, and my
guess is that it will have a drammatic effect on Mac OSX.


I will check that tomorrow on my machines.


The benefit of mmap() is being able to "for sure" release memory back
to the system.  The drawback is that it always incurrs a substantial
syscall overhead compared to malloc.  You decide which you prefer (I
think I would lean slightly toward mmap() for long lived applications,
but not by much, since the syscall introduces a lot of variance and an
average performance degradation).


Yep. I agree. I would avoid it if possible. But I know of no other
sure memory-returning call! I see that most (all?) of the allocs
I know just keep everything allocated and never returned.



How about adding this into the code?


I think the most obvious replacement is just using an if "tree":
if (size>0xff) bucket+=8, size&=0xff;
if (size>0xf) bucket+=4, size&0xf;
...
it takes a minute to get the math right, but the performance gain
should be substantial.


Well, I can test that allright. I have the feeling that a tight
loop as that (will mostly sping 5-12 times) gets well compiled
in machine code, but it is better to test.



In my tests, due to the frequency of calls of these functions they
contribute 10% to 15% overhead in performance.


Yes. That is what I was also getting. OTOH, the speed difference
between VT and zippy was sometimes several orders of magnitude
so I simply ignored that.


Ha! It is pretty simple: you can atomically check pointer equivalence
without risking a core (at least this is my experience). You are not
expected to make far-reaching decisions based on it, though.
In this particular example, even if the test was false, there  
would be

no "harm" done, just an inoptimal path would be selected.
I have marked that "Dirty read" to draw people attention on that  
place.

And, I succeeded obviously :-)


The dirty read I have no problem with.  It's the the possibility of
taking of the head element which could be placed there by another
thread that bothers me.


Ah, this will not happen. As, I take the global mutex at that point
so the pagePtr->p_cachePtr cannot be changed under our feet.
If that block was allocated by the current thread, the p_cachePtr
will not be changed by anybody. So no harm. If it is not, then we
must lock the global mutex to prevent anybody fiddling with that
element. It is tricky but it should work.


It sounds like you are in the best position to test this change to see
if it fixes the "unbounded" growth problem.



Yes! Indeed. The only thing I'd have to check is how much more
memory this will take. But is certainly worth trying it out
as it will be a temp relief to our users until we stress test
the VT to the max so I can include it in our standard distro.

-- 
---

Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to  
share your

opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php? 
page=join.php&p=sourceforge&CID=DEVDEV

___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel

Re: [naviserver-devel] Quest for malloc

2007-01-15 Thread Mike


> a)
> The test program Zoran includes biases Zippy toward "standard"
> allocator, which it does not do for VT.  The following patch
> "corrects" this behavior:
>
> +++ memtest.c   Sun Jan 14 16:43:23 2007
> @@ -211,6 +211,7 @@
>  } else {
>  size &= 0x3FFF; /* Limit to 16K */
>  }
> +   if (size>16000)
> size = 16000;
>  *toallocptr++ = size;
>  }
>  }
>


First of all, I wanted to give Zippy a fair chance. If I increase
the max allocation size, Zippy becomes even more slow than it is.
And, Zippy handles 16K pages, whereas we handle 32K pages.
Hence the

 size &= 0x3FFF; /* Limit to 16K */

which limits the allocation size to 16K max. To increase that
would even more hit Zippy than us.


Zoran, I believe you misunderstood.  The "patch" above limits blocks
allocated by your tester to 16000 instead of 16384 blocks.  The reason
for this is that Zippy's "largest bucket" is configured to be
16284-sizeof(Block) bytes (note the "2" in 16_2_84 is _NOT_ a typo).
By making uniformly random requests sizes up to 16_3_84, you are
causing Zippy to fall back to system malloc for a small fraction of
requests, substantially penalizing its performance in these cases.


> The following patch allows Zippy to be a lot less aggressive in
> putting blocks into the shared pool, bringing the performance of Zippy
> much closer to VT, at the expense of substantially higher memory
> "waste":
>
> @@ -128,12 +174,12 @@
>  {   64,  256, 128, NULL},
>  {  128,  128,  64, NULL},
>  {  256,   64,  32, NULL},
> -{  512,   32,  16, NULL},
> -{ 1024,   16,   8, NULL},
> -{ 2048,8,   4, NULL},
> -{ 4096,4,   2, NULL},
> -{ 8192,2,   1, NULL},
> -{16284,1,   1, NULL},
> +{  512,   64,  32, NULL},
> +{ 1024,   64,  32, NULL},
> +{ 2048,   64,  32, NULL},
> +{ 4096,   64,  32, NULL},
> +{ 8192,   64,  32, NULL},
> +{16284,   64,  32, NULL},
>

I cannot comment on that. Possibly you are right but I do not
see much benefit of that except speeding up Zippy to be on pair
with VT, whereas most important VT feature is not the speed,
it is the memory handling.


You wanted to know why Zippy is slower on your test, this is the
reason.  This has substantial impact on FreeBSD and linux, and my
guess is that it will have a drammatic effect on Mac OSX.


> VT releases the memory held in a thread's
> local pool when a thread terminates.  Since it uses mmap by default,
> this means that de-allocated storage is actually released to the
> operating system, forcing new threads to call mmap() again to get
> memory, thereby incurring system call overhead that could be avoided
> in some cases if the system malloc implementation did not lower the
> sbrk point at each deallocation. Using malloc() in VT allocator
> should give it much more uniform and consisent performance.

Not necessarily.  We'd shoot ourselves in the foot by doing so,
because most OS allocators never return memory to the system and
one of our major benefits will be gone.
What we could do: timestamp each page, return all pages to the
global cache and prune older. Or, put a size constraint on the
global cache. But then you'd have yet-another-knob to adjust
and the difficulty would be to find the right setup. VT is more
simple in that as it does not offer you ANY knobs you can trim
(for better or for worse). In some early stages of the design
we had number of knobs and were not certain how to adjust them.
So we threw that away and redesigned all parts to be "self adjusting"
if possible.


The benefit of mmap() is being able to "for sure" release memory back
to the system.  The drawback is that it always incurrs a substantial
syscall overhead compared to malloc.  You decide which you prefer (I
think I would lean slightly toward mmap() for long lived applications,
but not by much, since the syscall introduces a lot of variance and an
average performance degradation).


> e)
> Both allocators use an O(n) algorithm to compute the power of two
> "bucket" for the allocated size.  This is just plain silly since an
> O(n log n) algorithm will ofer non-negligible speed up in both
> allocators.  This is the current O(n) code:
>  while (bucket < NBUCKETS && globalCache.sizes
> [bucket].blocksize < size) {
> ++bucket;
> }

How about adding this into the code?


I think the most obvious replacement is just using an if "tree":
if (size>0xff) bucket+=8, size&=0xff;
if (size>0xf) bucket+=4, size&0xf;
...
it takes a minute to get the math right, but the performance gain
should be substantial.


> f)
> Zippy uses Ptr2Block and Block2Ptr functions where as VT uses macros
> for this.  Zippy also does more checks on MAGIC numbers on each
> allocation which VT only performs on de-allocation.  I am not sure if
> current compilers are smart enough to inline the functions in Zippy, I
> did not test this.  When compiled with

Re: [naviserver-devel] Quest for malloc

2007-01-15 Thread Stephen Deasey


On 1/15/07, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote:


Nobody yet gave any reasonable explanation why we are
that fast on Mac OSX compared to any other allocator.
Recall, that was 870.573/70.713.324 ops/sec Zippy/VT
so about 81 times faster, for 16 threads.
Although it really seems like a bug either in the testcode
or in the allocator, I have not been able to verify any.
All is working as it should. So, the mistery remains...




Because Mac OSX SucksMonkeyBawlz() in a tight inner loop?

The engineers at Apple have many fine achievements, but this kind of
system level performance isn't one of them.  All the benchmarks I've
ever seen show that SucksMonkeyBawlz() is sprinkled throughout the
code responsible for locking, context switching, memory allocation,
etc.

So, don't be surprised.  Enjoy the drop shadows!

Re: [naviserver-devel] Quest for malloc

2007-01-15 Thread Vlad Seryakov

I've been running new allocator for several weeks now on busy 
Naviserver, memory does not grow anymore, once threads exit it returns 
back and no crashes have been observed.


Zoran Vasiljevic wrote:

Am 19.12.2006 um 20:42 schrieb Stephen Deasey:


Zoran will be happy...  :-)


Zoran is again happy to put the next small update
of the (famous) VT malloc on:

 http://www.archiware.com/downloads/vtmalloc-0.0.2.tar.gz

For the list of changes since 0.0.1, please look in the
ChangeLog file.

As it seems, we are still pretty fast and, thanks to Mike,
we know why Zippy is that slow when exposed to our memtest
program.

Nobody yet gave any reasonable explanation why we are
that fast on Mac OSX compared to any other allocator.
Recall, that was 870.573/70.713.324 ops/sec Zippy/VT
so about 81 times faster, for 16 threads.
Although it really seems like a bug either in the testcode
or in the allocator, I have not been able to verify any.
All is working as it should. So, the mistery remains...

Cheers
Zoran





-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel



--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/

Re: [naviserver-devel] Quest for malloc

2007-01-15 Thread Zoran Vasiljevic



Am 19.12.2006 um 20:42 schrieb Stephen Deasey:



Zoran will be happy...  :-)


Zoran is again happy to put the next small update
of the (famous) VT malloc on:

http://www.archiware.com/downloads/vtmalloc-0.0.2.tar.gz

For the list of changes since 0.0.1, please look in the
ChangeLog file.

As it seems, we are still pretty fast and, thanks to Mike,
we know why Zippy is that slow when exposed to our memtest
program.

Nobody yet gave any reasonable explanation why we are
that fast on Mac OSX compared to any other allocator.
Recall, that was 870.573/70.713.324 ops/sec Zippy/VT
so about 81 times faster, for 16 threads.
Although it really seems like a bug either in the testcode
or in the allocator, I have not been able to verify any.
All is working as it should. So, the mistery remains...

Cheers
Zoran

Re: [naviserver-devel] Quest for malloc

2007-01-15 Thread Zoran Vasiljevic



Am 15.01.2007 um 10:27 schrieb Mike:



Although not entirely sure why, I have spent some time analyzing the
behavior and code of both of these allocators.


Well, I'd say it is simple "why": the results I have presented
are just too "tempting" so you wanted really to know *why*.
This is normal. I'd do the same.



a)
The test program Zoran includes biases Zippy toward "standard"
allocator, which it does not do for VT.  The following patch
"corrects" this behavior:

+++ memtest.c   Sun Jan 14 16:43:23 2007
@@ -211,6 +211,7 @@
 } else {
 size &= 0x3FFF; /* Limit to 16K */
 }
+   if (size>16000)  
size = 16000;

 *toallocptr++ = size;
 }
 }




First of all, I wanted to give Zippy a fair chance. If I increase
the max allocation size, Zippy becomes even more slow than it is.
And, Zippy handles 16K pages, whereas we handle 32K pages.
Hence the

size &= 0x3FFF; /* Limit to 16K */

which limits the allocation size to 16K max. To increase that
would even more hit Zippy than us.


b)
The key difference between Zippy and VT allocators arises form their
use of the shared "freed" memory pool.  Zippy calls this the shared
cache, VT calls this the global cache.  Zippy's goal appears to have
been to minimize memory usage (while the stated goal is to reduce lock
contention).  Zippy does this by aggressively moving freed blocks to
the shared cache, allowing any thread to later allocate memory from
this shared pool.  Meanwhile VT targets speed, trading off bloat, and
allowing freed blocks to return to the private per-thread pools.  To
allow for this speed optimization, VT keeps a pointer to the cache
that allocated it within each "page", something that can be done for
Zippy if speed was the goal.


Hmhm... Still, our intention is to be more conservative in *overall*
memory usage. That means, I'm prepared to give myself more memory
if I can be faster with that *temporarily* (after all modern systems
have huge memory banks) but I would not like to be greedy and keep
that memory for myself all the time. Which is precisely what VT does:
it is more memory hungry in terms of temporarily allocated memory
(although not that significant for this to be a problem) but it is
social-enough to release that when not needed any more.



c)
The key reason why Zippy substantially lags behind VT in performance
is actually because Zippy beats itself at its own game.  While it's
stated goal is to minimize lock contention, the hardcoded constants
used in Zippy actually completely sacrifice lock contention for
storage.  Naturally, thread-local pools can be used to allocate blocks
immediately, while the shared pool must be locked by a mutex when
allocation is performed.  The current Zippy configuration minimizes
the amount of storage "wasted" in per-thread pools by aggressively
moving larger blocks to the shared cache.  The more threads attempt to
allocate/free large blocks, the worse the contention and the lower the
performance.

Zoran's test program produces allocation sizes that are uniform
random, so large blocks are equally likely to small blocks, therefore
performance suffers substantially.  A more accurate benchmark would
take common usage patterns from Tcl/NaviServer, which I suspect are
heavily biased toward allocation of small objects.


If you can modify memtest.c to be like that I'd have nothing against!
Actually, we have no problems with small allocations nor with large ones
as they are all handled by the same mechanism. In Zippy large  
allocations

(over 16K) are just handled with the system malloc with all trade-offs
that this brings.



The following patch allows Zippy to be a lot less aggressive in
putting blocks into the shared pool, bringing the performance of Zippy
much closer to VT, at the expense of substantially higher memory
"waste":

@@ -128,12 +174,12 @@
 {   64,  256, 128, NULL},
 {  128,  128,  64, NULL},
 {  256,   64,  32, NULL},
-{  512,   32,  16, NULL},
-{ 1024,   16,   8, NULL},
-{ 2048,8,   4, NULL},
-{ 4096,4,   2, NULL},
-{ 8192,2,   1, NULL},
-{16284,1,   1, NULL},
+{  512,   64,  32, NULL},
+{ 1024,   64,  32, NULL},
+{ 2048,   64,  32, NULL},
+{ 4096,   64,  32, NULL},
+{ 8192,   64,  32, NULL},
+{16284,   64,  32, NULL},



I cannot comment on that. Possibly you are right but I do not
see much benefit of that except speeding up Zippy to be on pair
with VT, whereas most important VT feature is not the speed,
it is the memory handling.


d)
VT uses mmap by default to allocate memory, Zippy uses the system
malloc.  By doing this, VT actually penalizes itself in an environment
where lots of small blocks are frequently allocated and threads are
often created/destroyed.


Partly right. Lots of small blocks is no problem. We allocate 32K
pages that yields 2048 16-byte blocks, 1024 32-byte blocks etc.
So, small allocations are

Re: [naviserver-devel] Quest for malloc

2007-01-15 Thread Mike


Vlad has written an allocator that uses mmap to obtain
memory for the system and munmap that memory on thread
exit, if possible.

I have spent more than 3 weeks fiddling with that and
discussing it with Vlad and this is what we bith come to:

http://www.archiware.com/downloads/vtmalloc-0.0.1.tar.gz

I believe we have solved most of my needs. Below is an excerpt
from the README file for the qurious.

If anybody would care to test it in his/her own environment?
If all goes well, I might TIP this to be included in Tcl core
as replacement of (or addition to) the zippy allocator.


Although not entirely sure why, I have spent some time analyzing the
behavior and code of both of these allocators.  Since I don't really
want to spend too much more, the following comments are not organized
in any particulr order of importance or relevance...

a)
The test program Zoran includes biases Zippy toward "standard"
allocator, which it does not do for VT.  The following patch
"corrects" this behavior:

+++ memtest.c   Sun Jan 14 16:43:23 2007
@@ -211,6 +211,7 @@
} else {
size &= 0x3FFF; /* Limit to 16K */
}
+   if (size>16000) size = 16000;
*toallocptr++ = size;
}
}

b)
The key difference between Zippy and VT allocators arises form their
use of the shared "freed" memory pool.  Zippy calls this the shared
cache, VT calls this the global cache.  Zippy's goal appears to have
been to minimize memory usage (while the stated goal is to reduce lock
contention).  Zippy does this by aggressively moving freed blocks to
the shared cache, allowing any thread to later allocate memory from
this shared pool.  Meanwhile VT targets speed, trading off bloat, and
allowing freed blocks to return to the private per-thread pools.  To
allow for this speed optimization, VT keeps a pointer to the cache
that allocated it within each "page", something that can be done for
Zippy if speed was the goal.

c)
The key reason why Zippy substantially lags behind VT in performance
is actually because Zippy beats itself at its own game.  While it's
stated goal is to minimize lock contention, the hardcoded constants
used in Zippy actually completely sacrifice lock contention for
storage.  Naturally, thread-local pools can be used to allocate blocks
immediately, while the shared pool must be locked by a mutex when
allocation is performed.  The current Zippy configuration minimizes
the amount of storage "wasted" in per-thread pools by aggressively
moving larger blocks to the shared cache.  The more threads attempt to
allocate/free large blocks, the worse the contention and the lower the
performance.

Zoran's test program produces allocation sizes that are uniform
random, so large blocks are equally likely to small blocks, therefore
performance suffers substantially.  A more accurate benchmark would
take common usage patterns from Tcl/NaviServer, which I suspect are
heavily biased toward allocation of small objects.

The following patch allows Zippy to be a lot less aggressive in
putting blocks into the shared pool, bringing the performance of Zippy
much closer to VT, at the expense of substantially higher memory
"waste":

@@ -128,12 +174,12 @@
{   64,  256, 128, NULL},
{  128,  128,  64, NULL},
{  256,   64,  32, NULL},
-{  512,   32,  16, NULL},
-{ 1024,   16,   8, NULL},
-{ 2048,8,   4, NULL},
-{ 4096,4,   2, NULL},
-{ 8192,2,   1, NULL},
-{16284,1,   1, NULL},
+{  512,   64,  32, NULL},
+{ 1024,   64,  32, NULL},
+{ 2048,   64,  32, NULL},
+{ 4096,   64,  32, NULL},
+{ 8192,   64,  32, NULL},
+{16284,   64,  32, NULL},

d)
VT uses mmap by default to allocate memory, Zippy uses the system
malloc.  By doing this, VT actually penalizes itself in an environment
where lots of small blocks are frequently allocated and threads are
often created/destroyed. VT releases the memory held in a thread's
local pool when a thread terminates.  Since it uses mmap by default,
this means that de-allocated storage is actually released to the
operating system, forcing new threads to call mmap() again to get
memory, thereby incurring system call overhead that could be avoided
in some cases if the system malloc implementation did not lower the
sbrk point at each deallocation.  Using malloc() in VT allocator
should give it much more uniform and consisent performance.  Using
mmap() in Zippy has less performnace impact since memory is never
released by Zippy (at thread termintion it is just placed back into
the shared pool).

Another obvious downside of using mmap() for Zippy is that realloc()
must always fall back to the slow allocate/copy/free mechanism and can
never be optimized.

e)
Both allocators use an O(n) algorithm to compute the power of two
"bucket" for the allocated size.  This is just plain silly since an
O(n log n) algorithm will ofer non-negligible speed up in both
allocators.  This is

Re: [naviserver-devel] Quest for malloc

2007-01-13 Thread Zoran Vasiljevic



Am 13.01.2007 um 10:45 schrieb Gustaf Neumann:


Fault was, that i did not read the README (i read the frist one) and
compiled (a) without -DTCL_THREADS .


In that case, fault was that on FreeBSD you need to
explictly put "-pthread" when linking the test program,
regardless of the fact that libtcl8.4.so was already
linked with it. That, only, did the trick.

Speed was (as expected and still not clear why)
at least 2 times better than anything else. In some
rough cases it was _significantly_ faster.

But... I believe we should not fixate ourselves to the
speed of the allocator. It was not our intention to
make something faster. Our intention was to release
memory early enough so we don't bloat the system as
a long-running process.

I admit, speed of the code is always the most interesting
and tempting issue for engineers, but in this case it
was really the memory savings for long-running programs
that we were after.

Having said that, I must again repeat that we'd like to
get some field-experience with the allocator before we
do any further steps. This means that we are thankful
for any feedback.

Cheers,
zoran

Re: [naviserver-devel] Quest for malloc

2007-01-13 Thread Mike

On 1/13/07, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote:

Am 13.01.2007 um 06:17 schrieb Mike:

>  I'm happy to offer ssh access to a test
> box where you can reproduce these results.

Oh, that is very fine! Can you give me the
access data? You can post me the login-details
in a separate private mail.

Zoran,
Tried to contact you, but did not receive reply.  Check your spam
filter/email me.

Re: [naviserver-devel] Quest for malloc

2007-01-13 Thread Zoran Vasiljevic



Am 13.01.2007 um 10:45 schrieb Gustaf Neumann:


PPS: strangly, the only think making me supicious is the
huge amount of improvement, especially on Mac OS X.


Look...
Running the test program unmodified (on Mac Pro box):

Test Tcl allocator with 4 threads, 16000 records ...
This allocator achieves 35096360 ops/sec under 4 threads
Press return to exit (observe the current memory footprint!)

If I modify the memtest.c program at line 146 to read:

 if (dorealloc && (allocptr > tdata[tid].allocs) && (r & 1)) {
 allocptr[-1] = reallocs[whichmalloc](allocptr[-1],  
*toallocptr);

 } else {
 allocptr[0] = mallocs[whichmalloc](*toallocptr);
/*-->*/  memset(allocptr[0], 0, *toallocptr > 64 ? 64 : *toallocptr);
 allocptr++;
 }

Test Tcl allocator with 4 threads, 16000 records ...
This allocator achieves 28377808 ops/sec under 4 threads
Press return to exit (observe the current memory footprint!)

If I memset the whole memory area, not just first 64 bytes:

Test Tcl allocator with 4 threads, 16000 records ...
This allocator achieves 14862477 ops/sec under 4 threads
Press return to exit (observe the current memory footprint!)


BUT, guess what! The system allocator gives me (using same test data
i.e. memsetting the whole allocated chunk):

Test standard allocator with 4 threads, 16000 records ...
This allocator achieves 869716 ops/sec under 4 threads
Press return to exit (observe the current memory footprint!)


So we are still: 14862477/869716 = 17 times faster. With increasing
thread count we get faster and faster whereas system allocator stays
at the same (low) level or is getting slower.

Now, I would really like to know why! Perhaps the fact that we are
using mmap() instead of god-knows-what Apple is using...

Anyways... either we have some very big error there (in which
case I'd like to know where, as everything is working as it should!)
or we have found much better way to handle memory on Mac OSX :-)

Cheers
Zoran

Re: [naviserver-devel] Quest for malloc

2007-01-13 Thread Zoran Vasiljevic



Am 13.01.2007 um 10:45 schrieb Gustaf Neumann:


correcting these configuration isses, the program works VERY well
i tried on 32bit and 64bit machines (a minor complaint in the memtest
program, casting 32bit int to ClientData and vice versa)


Where exactly so I can fix that?



-gustaf

PS: i could get access to an 64-bit amd FreeBSD machine
on monday, if there is still need...


Well, I could use some hands-on experience...
Please send me some login data so I can check it out!



PPS: strangly, the only think making me supicious is the
huge amount of improvement, especially on Mac OS X.
I can't remember in my experience having seen such a
drastical performance increase by a realitive litte code
change, especially in an area, which is usually carefully
fine-tuned, and where many CS grads from all over the world
writing their thesis on


This *is* the only fact that's *really* puzzling me so much.
I cannot explain that at all. I was also stepping the whole
thing with the debugger because I thought there must be
some error somewhere, but I found none! Then I thought it
is the test program that does something weird. But it's not!
The test program just happily allocates random chunks between
16 and 16384 bytes and then releases them. No big magic.
And certainly not something of a rocket science. So the mistery
remains... Moreover, the Tcl allocator seems to suck greatly
when exposed to such test on Mac OSX. Also not very explainable.
The only thing I noticed is: Tcl allocator uses MUCH system
time, whereas our alloc uses close-to-none. That would point
that it does lots of locking (I cannot imagine it would do
anything else system-related) whereas our alloc uses close to
none locking (that is, when the memory is allocated and freed
in the same thread, which is what happens 99% of the time).



I would recommend that Vlad
and Zoran should write a technical paper about the new
allocator and analyze the properties and differences


If I could ever get time for that! What we could write is a
more in-depth explanation how it works (it is actually very
simple).

Cheers
Zoran

Re: [naviserver-devel] Quest for malloc

2007-01-13 Thread Gustaf Neumann




I downloaded the code in the previous mail.  After some minor path
adjustments, I was able to get the test program to compile and link
under FreeBSD 6.1 running on a dual-processor PIII system, linked
against a threaded tcl 8.5a.  I could get this program to consistently
do one of two things:
- dump core
- hang seemingly forever
but absolutely nothing else.
  

Mike,

when zoran annouced the version, i downloaded it and had similar 
expericences.
Fault 1 turned out to be: The link of Zoran lead to a premature version 
of the
software, not the real thing (the right version is untarred to a 
directory containing

the verion numbers).

Then zoran corrected the link, i refetched, and .. well no makefile.
Just complile and try: same effect.
Fault was, that i did not read the README (i read the frist one) and
compiled (a) without -DTCL_THREADS .

i had exectly the same symptoms.

correcting these configuration isses, the program works VERY well
i tried on 32bit and 64bit machines (a minor complaint in the memtest
program, casting 32bit int to ClientData and vice versa)

-gustaf

PS: i could get access to an 64-bit amd FreeBSD machine
on monday, if there is still need...

PPS: strangly, the only think making me supicious is the
huge amount of improvement, especially on Mac OS X.
I can't remember in my experience having seen such a
drastical performance increase by a realitive litte code
change, especially in an area, which is usually carefully
fine-tuned, and where many CS grads from all over the world
writing their thesis on I would recommend that Vlad
and Zoran should write a technical paper about the new
allocator and analyze the properties and differences

Re: [naviserver-devel] Quest for malloc

2007-01-13 Thread Zoran Vasiljevic



Am 13.01.2007 um 06:17 schrieb Mike:


Running this program under the latest version of valgrind (using
memcheck or helgrind tools) reveals numerous errors from valgrind,
which I suspect (although I did not confirm) are the reason for the
core dumps and infinite hangs when it is run on its own.


Even more interesting... I just gave it a Purify run
on Solaris 2.8 with 4 threads and it revealed absolutely
no problems nor leaks. Heh?

Can it be that the problem is not the alloc code but the
tcl 8.5 alpha that you linked against? I never tested
anything else then 8.4.14. Please be aware that I
haven't touched the 8.5 tree up to now so there could
be some problems there, as there have been lots of
changes in tcl head branch lately.

To save your and my time, an access to your box
where I can verify that odd behaviour that you're
reporting will be very helpful!

Cheers
Zoran

Re: [naviserver-devel] Quest for malloc

2007-01-13 Thread Zoran Vasiljevic



Am 13.01.2007 um 06:17 schrieb Mike:


 I'm happy to offer ssh access to a test
box where you can reproduce these results.


Oh, that is very fine! Can you give me the
access data? You can post me the login-details
in a separate private mail.

Thanks,
Zoran

Re: [naviserver-devel] Quest for malloc

2007-01-13 Thread Zoran Vasiljevic



Am 13.01.2007 um 06:17 schrieb Mike:


I downloaded the code in the previous mail.  After some minor path
adjustments, I was able to get the test program to compile and link
under FreeBSD 6.1 running on a dual-processor PIII system, linked
against a threaded tcl 8.5a.  I could get this program to consistently
do one of two things:
- dump core
- hang seemingly forever
but absolutely nothing else.
Running this program under the latest version of valgrind (using
memcheck or helgrind tools) reveals numerous errors from valgrind,
which I suspect (although I did not confirm) are the reason for the
core dumps and infinite hangs when it is run on its own.


Hey, it is the first time *ever* it got to the public,
so do not expect mission-critical bullet-proof code!
No wonder there are still errors there, but those are
to be fixed, of course. After all, at least two persons
(myself and Vlad) are going to include this work in
production system(s). So it needs much tests, of course
Thank you for taking a look at it.

If you'd like to help a bit... compile the Tcl with
--enable-symbols and hit it again until it crashes.
Then inspect the core with the debugger and give me
the stack trace of the crashing thread.

And, generally speaking...

I would not spent time on that if that's avoidable.
Show me a good memory conservative allocator that is fast
enough and returns memory to the system and works accross
Linux, Solaris, Mac OSX and Windows? To my knowledge,
there is none. During all this (testing and developing) time,
I found the Solaris alloc to be the most-appropriate,
but still, this one also grabs all the memory it can and
never releases it back! So, the question is not that I'd
like some exercise in writing memory allocators. I don't.
I have *plenty* of other work on my back. But we happen
to have a product out there (already 1000+ installations
worldwide) that needs a reboot each day because of the way
it consumes system memory. Not leaks. Regular consumption.
So I have a very pressing need to undertake something in
this direction, if you understand what I mean.

Now if you can get me some debug data from your box
so I can check what is going on, that would be very nice!

Cheers
Zoran

Re: [naviserver-devel] Quest for malloc

2007-01-12 Thread Mike


I've been on a search for an allocator that will be fast
enough and not so memory hungry as the allocator being
built in Tcl. Unfortunately, as it mostly is, it turned
out that I had to write my own.

Vlad has written an allocator that uses mmap to obtain
memory for the system and munmap that memory on thread
exit, if possible.

I have spent more than 3 weeks fiddling with that and
discussing it with Vlad and this is what we bith come to:

http://www.archiware.com/downloads/vtmalloc-0.0.1.tar.gz

I believe we have solved most of my needs. Below is an excerpt
from the README file for the qurious.

If anybody would care to test it in his/her own environment?
If all goes well, I might TIP this to be included in Tcl core
as replacement of (or addition to) the zippy allocator.


Zoran,

Because I am quite biased here, to avoid later being branded as
biased,I want to explicitly state my bias up front: In my experience,
very little good comes out of people writing their own memory
allocators.  There is a small number of people in this world for who
this privilege should be reserved (outside of a classroom excercise,
of course), and the rest of us humble folk should help them when we
can but generally stay out of the way - setting out to reinvent the
wheel is not a good thing.

I downloaded the code in the previous mail.  After some minor path
adjustments, I was able to get the test program to compile and link
under FreeBSD 6.1 running on a dual-processor PIII system, linked
against a threaded tcl 8.5a.  I could get this program to consistently
do one of two things:
- dump core
- hang seemingly forever
but absolutely nothing else.
Running this program under the latest version of valgrind (using
memcheck or helgrind tools) reveals numerous errors from valgrind,
which I suspect (although I did not confirm) are the reason for the
core dumps and infinite hangs when it is run on its own.

I have no time to debug this myself,  however in the interest of
science and general progress, I'm happy to offer ssh access to a test
box where you can reproduce these results.  I strongly advise against
using a benchmark with the above characteristics to make any decisions
about speed or memory consumption improvements or problems.

---

After toying around with this briefly, I was able to run the test
program under valgrind after specifying a -rec value of 1000 or less.
Despite some errors reported by valgrind, the test program does run to
completion and report its results in these cases.

standard allocator:
This allocator achieves 43982 ops/sec under 4 threads
tcl allocator:
This allocator achieves 21251 ops/sec under 4 threads
improved tcl allocator:
This allocator achieves 21308 ops/sec under 4 threads

But again, I would not draw any serious conclusions from these numbers.

Re: [naviserver-devel] Quest for malloc

2007-01-12 Thread Zoran Vasiljevic



Am 19.12.2006 um 20:42 schrieb Stephen Deasey:


On 12/19/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote:


Right, with Ns_ functions it does not crash.



Zoran will be happy...  :-)


In fact, yes! I'm more than happy to announce something
that will change the way we use computers in 22 century
(if we live enough to witness it) :-)

Seriously...

I've been on a search for an allocator that will be fast
enough and not so memory hungry as the allocator being
built in Tcl. Unfortunately, as it mostly is, it turned
out that I had to write my own.

Vlad has written an allocator that uses mmap to obtain
memory for the system and munmap that memory on thread
exit, if possible.

I have spent more than 3 weeks fiddling with that and
discussing it with Vlad and this is what we bith come to:

   http://www.archiware.com/downloads/vtmalloc-0.0.1.tar.gz

I believe we have solved most of my needs. Below is an excerpt
from the README file for the qurious.

If anybody would care to test it in his/her own environment?
If all goes well, I might TIP this to be included in Tcl core
as replacement of (or addition to) the zippy allocator.

-

Compared was performace of OS memory allocator (Standard),
Tcl built-in threading allocator (Zippy) and this (VT) allocator.

First table shows testing alloc/free operations on 16000 blocks of  
memory

each of random size, between 16 and 16384 bytes. The total number of
blocks is divided among threads, so 1 thread operates on 16000 blocks,
2 threads each on 8000, 4 threads each at 4000 blocks etc.
For each test, program was run three times and the best value was taken.
Speed numbers are in operations/second. More is better.

Second table showns memory usage. Values are gathered by peeking
at the system "top" utility.
The "Top" is peak memory during program run.
The "Low" is just before the program exits.
Memory usage numbers are (rounded) in MB. Less is better.


Machine: Apple Mac Pro, 2 x Intel Core Duo 2.66GHz, 1GB, Mac OSX 10.4.8


| Allocator| 1 threads | 2 threads | 4 threads | 8 threads |16 threads |
+==+===+===+===+===+===+
| Standard | 2.316.454 | 2.187.852 | 2.103.777 | 2.108.825 | 2.304.939 |
+--+---+---+---+---+---+
| Zippy| 7.111.380 | 3.214.132 | 1.450.300 |   851.347 |   870.573 |
+--+---+---+---+---+---+
| VT   |25.047.968 |25.438.877 |30.615.718 |48.845.898 |70.713.324 |

=
|   |   Top   |   Low   |
| Allocator |  Resident  |  Virtual   |  Resident  |   Virtual  |
+---+++++
| Standard  | 49 |125 | 49 |112 |
+---+++++
| Zippy |102 |182 |102 |182 |
+---+++++
| VT| 43 |169 |  1 | 50 |
=



Machine: Sun Ultra 20, 1 x AMD 2.6GHz, 2GB, Solaris 10


| Allocator| 1 threads | 2 threads | 4 threads | 8 threads |16 threads |
+==+===+===+===+===+===+
| Standard | 7.725.757 | 7.940.706 | 8.661.384 | 9.673.767 |11.348.060 |
+--+---+---+---+---+---+
| Zippy| 9.375.668 | 9.638.397 |10.044.609 |10.121.013 |10.126.495 |
+--+---+---+---+---+---+
| VT   |13.539.585 |14.018.716 |14.058.184 |14.287.382 |15.206.398 |

=
|   |   Top   |   Low   |
| Allocator |  Resident  |  Virtual   |  Resident  |   Virtual  |
+---+++++
| Standard  |  67|  97|  67|  97|
+---+++++
| Zippy | 128| 153| 128| 153|
+---+++++
| VT|  44| 137|   2|  19|
=



Machine: AMD Athlon XP2200, 1.8GHz, 512MB, Linux Suse9.1


| Allocator| 1 threads | 2 threads | 4 threads | 8 threads |16 threads |
+==+===+===+===+===+===

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Vlad Seryakov

On linux Tcl version of the test just crashes constantly in free, i have 
no other OSes here


(gdb) bt
#0  0xb7f0d410 in ?? ()
#1  0xb6cd1b78 in ?? ()
#2  0x0006 in ?? ()
#3  0x3746 in ?? ()
#4  0xb7d3f731 in raise () from /lib/libc.so.6
#5  0xb7d40f08 in abort () from /lib/libc.so.6
#6  0xb7d74e7b in __libc_message () from /lib/libc.so.6
#7  0xb7d7ab10 in malloc_printerr () from /lib/libc.so.6
#8  0xb7d7c1a9 in free () from /lib/libc.so.6
#9  0x080485d9 in MemThread (arg=0x0) at ttest.c:33
#10 0xb7e8943f in NewThreadProc (clientData=0x804a358)
at 
/home/vlad/src/ossweb/external/archlinux/tcl/src/tcl8.4.14/unix/../generic/tclEvent.c:1229

#11 0xb7d014a2 in start_thread () from /lib/libpthread.so.0
#12 0xb7dd5ede in clone () from /lib/libc.so.6

Zoran Vasiljevic wrote:

On 19.12.2006, at 20:42, Stephen Deasey wrote:


On 12/19/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote:

Right, with Ns_ functions it does not crash.


Zoran will be happy...  :-)




Not at all!

So, I would like to know exactly how to reproduce the problem
(what OS, machine, etc).
Furthermore I need all your test-code and eventually the gdb
trace of the crash, to start with.

Can you get all that for me?


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel



--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/

#include 

#include 
#include 
#include 
#include 
#include 

#define MemAlloc malloc
#define MemFree free

static int nbuffer = 16384;
static int nloops = 5;
static int nthreads = 4;

static void *gPtr = NULL;
static Tcl_Mutex *gLock = NULL;

void MemThread(void *arg)
{
 int   i,n;
 void *ptr = NULL;

 for (i = 0; i < nloops; ++i) {
 n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0)));
 if (ptr != NULL) {
 MemFree(ptr);
 }
 ptr = MemAlloc(n);
 if (n % 50 == 0) {
 Tcl_MutexLock(&gLock);
 if (gPtr != NULL) {
 MemFree(gPtr);
 gPtr = NULL;
 } else {
 gPtr = MemAlloc(n);
 }
 Tcl_MutexUnlock(&gLock);
 }
 }
}

int main (int argc, char **argv)
{
 int i;
 Tcl_ThreadId *tids;

 tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads);

 for (i = 0; i < nthreads; ++i) {
 Tcl_CreateThread( &tids[i], MemThread, NULL, TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE);
 }
 for (i = 0; i < nthreads; ++i) {
 Tcl_JoinThread(tids[i], NULL);
 }
}

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Zoran Vasiljevic



On 19.12.2006, at 20:42, Stephen Deasey wrote:


On 12/19/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote:


Right, with Ns_ functions it does not crash.



Zoran will be happy...  :-)




Not at all!

So, I would like to know exactly how to reproduce the problem
(what OS, machine, etc).
Furthermore I need all your test-code and eventually the gdb
trace of the crash, to start with.

Can you get all that for me?

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Stephen Deasey


On 12/19/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote:


Right, with Ns_ functions it does not crash.



Zoran will be happy...  :-)

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Vlad Seryakov


Right, with Ns_ functions it does not crash.

Stephen Deasey wrote:

On 12/19/06, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote:

On 19.12.2006, at 17:08, Vlad Seryakov wrote:


I converted all to use pthreads directly instead of Tcl wrappers, and
now it does not crash anymore. Will continue testing but it looks like
Tcl is the problem here, not ptmalloc

Where does it crash? I see you are just using
Tcl_CreateThread
Tcl_MutexLock/Unlock
Tcl_JoinThread
Those just fallback to underlying pthread lib.
It makes no real sense. I believe.



Simply loading the Tcl library initialises a bunch of thread stuff,
right?  Also, the Tcl mutexes are self initialising, which includes
calling down into the global Tcl mutex.  Lots of stuff going on behind
the scenes...

NaviServer mutexes are also self initialising, but they call down to
the pthread_ functions without touching any Tcl code, which may
explain why the server isn't crashing all the time.

So here's a test: what happens when you compile the test program to
use Ns_Mutex and Ns_ThreadCreate etc.? Pthreads work, Tcl doesn't, how
about NaviServer?

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel



--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/

/*
 * gcc -I/usr/local/ns/include -g ttest.c -o ttest -lpthread /usr/local/ns/lib/libnsthread.so
 *
 */

#include 
#include 
#include 
#include 
#include 
#include 

#define MemAlloc malloc
#define MemFree free

static int nbuffer = 16384;
static int nloops = 15;
static int nthreads = 12;

static void *gPtr = NULL;
static Ns_Mutex gLock;

void MemThread(void *arg)
{
 int   i,n;
 void *ptr = NULL;

 for (i = 0; i < nloops; ++i) {
 n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0)));
 if (ptr != NULL) {
 MemFree(ptr);
 }
 ptr = MemAlloc(n);
 if (n % 50 == 0) {
 Ns_MutexLock(&gLock);
 if (gPtr != NULL) {
 MemFree(gPtr);
 gPtr = NULL;
 } else {
 gPtr = MemAlloc(n);
 }
 Ns_MutexUnlock(&gLock);
 }
 }
}

int main (int argc, char **argv)
{
int i;
Ns_Thread *tids;

if (argc > 1) {
nthreads = atoi(argv[1]);
}
if (argc > 2) {
nloops = atoi(argv[2]);
}
if (argc > 3) {
nbuffer = atoi(argv[3]);
}

tids = (Ns_Thread *)malloc(sizeof(Tcl_ThreadId) * nthreads);

for (i = 0; i < nthreads; ++i) {
Ns_ThreadCreate(MemThread, 0, 0, &tids[i]);
}
for (i = 0; i < nthreads; ++i) {
Ns_ThreadJoin(&tids[i], NULL);
}
}

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Stephen Deasey

On 12/19/06, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote:

On 19.12.2006, at 17:08, Vlad Seryakov wrote:

> I converted all to use pthreads directly instead of Tcl wrappers, and
> now it does not crash anymore. Will continue testing but it looks like
> Tcl is the problem here, not ptmalloc

Where does it crash? I see you are just using
Tcl_CreateThread
Tcl_MutexLock/Unlock
Tcl_JoinThread
Those just fallback to underlying pthread lib.
It makes no real sense. I believe.

Simply loading the Tcl library initialises a bunch of thread stuff,
right?  Also, the Tcl mutexes are self initialising, which includes
calling down into the global Tcl mutex.  Lots of stuff going on behind
the scenes...

NaviServer mutexes are also self initialising, but they call down to
the pthread_ functions without touching any Tcl code, which may
explain why the server isn't crashing all the time.

So here's a test: what happens when you compile the test program to
use Ns_Mutex and Ns_ThreadCreate etc.? Pthreads work, Tcl doesn't, how
about NaviServer?

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Vlad Seryakov

I have no idea, i spent too much time on this still without realizing 
what i am doing and what to expect :-)))


Zoran Vasiljevic wrote:

On 19.12.2006, at 17:08, Vlad Seryakov wrote:


I converted all to use pthreads directly instead of Tcl wrappers, and
now it does not crash anymore. Will continue testing but it looks like
Tcl is the problem here, not ptmalloc


Where does it crash? I see you are just using
Tcl_CreateThread
Tcl_MutexLock/Unlock
Tcl_JoinThread
Those just fallback to underlying pthread lib.
It makes no real sense. I believe.





-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel



--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Zoran Vasiljevic



On 19.12.2006, at 17:08, Vlad Seryakov wrote:


I converted all to use pthreads directly instead of Tcl wrappers, and
now it does not crash anymore. Will continue testing but it looks like
Tcl is the problem here, not ptmalloc


Where does it crash? I see you are just using
Tcl_CreateThread
Tcl_MutexLock/Unlock
Tcl_JoinThread
Those just fallback to underlying pthread lib.
It makes no real sense. I believe.

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Vlad Seryakov

I converted all to use pthreads directly instead of Tcl wrappers, and 
now it does not crash anymore. Will continue testing but it looks like 
Tcl is the problem here, not ptmalloc


Stephen Deasey wrote:

On 12/19/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote:

yes, it crashes when number of threads are more than 1 with any size but
not all the time, sometimes i need to run it several times, looks like
it is random, some combination, not sure of what.

I guess we never got that high concurrency in Naviserver, i wonder if
AOL has randomm crashes.



You're still using Tcl threads. Strip it out.
Make the loops and bock size command line parameters.

If you think you've found a bug you'll want the most concise test case
so you can report it to the glibc maintainers.


#glibc on irc.freenode.net

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel



--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Stephen Deasey


On 12/19/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote:

yes, it crashes when number of threads are more than 1 with any size but
not all the time, sometimes i need to run it several times, looks like
it is random, some combination, not sure of what.

I guess we never got that high concurrency in Naviserver, i wonder if
AOL has randomm crashes.



You're still using Tcl threads. Strip it out.
Make the loops and bock size command line parameters.

If you think you've found a bug you'll want the most concise test case
so you can report it to the glibc maintainers.


#glibc on irc.freenode.net

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Zoran Vasiljevic



On 19.12.2006, at 16:35, Vlad Seryakov wrote:

yes, it crashes when number of threads are more than 1 with any  
size but

not all the time, sometimes i need to run it several times, looks like
it is random, some combination, not sure of what.

I guess we never got that high concurrency in Naviserver, i wonder if
AOL has randomm crashes.


Concurrency or not, I'm running it on a fastest mac
you can buy and tweak to 16 threads and increase loop
from 5 to 50 and get this:

(with nedmalloc)
Blitzer:~/nedmalloc_tcl root# time ./tcltest

real0m2.036s
user0m4.652s
sys 0m1.823s

(with standard malloc)
Blitzer:~/nedmalloc_tcl root# time ./tcltest
real0m9.140s
user0m17.319s
sys 0m17.397s

So that's about 4 times faster. I cannot reproduce
any crash, whatever I try.

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Vlad Seryakov

yes, it crashes when number of threads are more than 1 with any size but 
not all the time, sometimes i need to run it several times, looks like 
it is random, some combination, not sure of what.


I guess we never got that high concurrency in Naviserver, i wonder if 
AOL has randomm crashes.


Stephen Deasey wrote:

Is this really the shortest test case you can make for this problem?

- Does it crash if you allocate blocks of size 1024 rather than random size?
  Does for me. Strip it out.

- Does it crash if you run 2 threads instead of 4?
  Does for me. Strip it out.

Some times it crashes, some times it doesn't. Clearly it's timing
related.  The root cause is not going to be identified by injecting a
whole bunch of random!

Make this program shorter.


On 12/19/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote:

I tried nedmalloc with LD_PRELOAD for my little test and it crashed vene
before the start.

Zoran, can you test it on Solaris and OSX so we'd know that is not Linux
related problem.


#include 

#include 
#include 
#include 
#include 
#include 

#define MemAlloc malloc
#define MemFree free

static int nbuffer = 16384;
static int nloops = 5;
static int nthreads = 4;

static void *gPtr = NULL;
static Tcl_Mutex gLock;

void MemThread(void *arg)
{
  int   i,n;
  void *ptr = NULL;

  for (i = 0; i < nloops; ++i) {
  n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0)));
  if (ptr != NULL) {
  MemFree(ptr);
  }
  ptr = MemAlloc(n);
  if (n % 50 == 0) {
  Tcl_MutexLock(&gLock);
  if (gPtr != NULL) {
  MemFree(gPtr);
  gPtr = NULL;
  } else {
  gPtr = MemAlloc(n);
  }
  Tcl_MutexUnlock(&gLock);
  }
  }
}

int main (int argc, char **argv)
{
  int i;
  Tcl_ThreadId *tids;

  tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads);

  for (i = 0; i < nthreads; ++i) {
  Tcl_CreateThread( &tids[i], MemThread, NULL,
TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE);
  }
  for (i = 0; i < nthreads; ++i) {
  Tcl_JoinThread(tids[i], NULL);
  }
}




Zoran Vasiljevic wrote:

On 19.12.2006, at 01:10, Stephen Deasey wrote:


This program allocates memory in a worker thread and frees it in the
main thread. If all free()'s put memory into a thread-local cache then
you would expect this program to bloat, but it doesn't, so I guess
it's not a problem (at least not on Fedora Core 5).

It is also not the case with nedmalloc as it specifically
tracks that usage pattern. The block being free'd "knows"
to which so-called mspace it belongs regardless which thread
free's it.

So, I'd say the nedmalloc is OK in this respect.
I have given it a purify run and it runs cleanly.
Our application is nnoticeably faster on Mac and
bloats less. But this is only a tip of the iceberg.
We yet have to give it a real stress-test on the
field, yet I'm reluctant to do this now and will
have to wait for a major release somewhere in spring
next year.




-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel


--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel



-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel



--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Stephen Deasey


Is this really the shortest test case you can make for this problem?

- Does it crash if you allocate blocks of size 1024 rather than random size?
 Does for me. Strip it out.

- Does it crash if you run 2 threads instead of 4?
 Does for me. Strip it out.

Some times it crashes, some times it doesn't. Clearly it's timing
related.  The root cause is not going to be identified by injecting a
whole bunch of random!

Make this program shorter.


On 12/19/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote:

I tried nedmalloc with LD_PRELOAD for my little test and it crashed vene
before the start.

Zoran, can you test it on Solaris and OSX so we'd know that is not Linux
related problem.


#include 

#include 
#include 
#include 
#include 
#include 

#define MemAlloc malloc
#define MemFree free

static int nbuffer = 16384;
static int nloops = 5;
static int nthreads = 4;

static void *gPtr = NULL;
static Tcl_Mutex gLock;

void MemThread(void *arg)
{
  int   i,n;
  void *ptr = NULL;

  for (i = 0; i < nloops; ++i) {
  n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0)));
  if (ptr != NULL) {
  MemFree(ptr);
  }
  ptr = MemAlloc(n);
  if (n % 50 == 0) {
  Tcl_MutexLock(&gLock);
  if (gPtr != NULL) {
  MemFree(gPtr);
  gPtr = NULL;
  } else {
  gPtr = MemAlloc(n);
  }
  Tcl_MutexUnlock(&gLock);
  }
  }
}

int main (int argc, char **argv)
{
  int i;
  Tcl_ThreadId *tids;

  tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads);

  for (i = 0; i < nthreads; ++i) {
  Tcl_CreateThread( &tids[i], MemThread, NULL,
TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE);
  }
  for (i = 0; i < nthreads; ++i) {
  Tcl_JoinThread(tids[i], NULL);
  }
}




Zoran Vasiljevic wrote:
> On 19.12.2006, at 01:10, Stephen Deasey wrote:
>
>> This program allocates memory in a worker thread and frees it in the
>> main thread. If all free()'s put memory into a thread-local cache then
>> you would expect this program to bloat, but it doesn't, so I guess
>> it's not a problem (at least not on Fedora Core 5).
>
> It is also not the case with nedmalloc as it specifically
> tracks that usage pattern. The block being free'd "knows"
> to which so-called mspace it belongs regardless which thread
> free's it.
>
> So, I'd say the nedmalloc is OK in this respect.
> I have given it a purify run and it runs cleanly.
> Our application is nnoticeably faster on Mac and
> bloats less. But this is only a tip of the iceberg.
> We yet have to give it a real stress-test on the
> field, yet I'm reluctant to do this now and will
> have to wait for a major release somewhere in spring
> next year.
>
>
>
>
> -
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ___
> naviserver-devel mailing list
> naviserver-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/naviserver-devel
>

--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Vlad Seryakov

I was suspecting Linux malloc, looks like it has problems with high 
concurrency, i tried to replace MemAlloc/Fre with mmap/munmap, and it 
crashes as well.


#define MemAlloc mmalloc
#define MemFree(ptr) mfree(ptr, gSize)
void *mmalloc(size_t size) { return 
mmap(NULL,size,PROT_READ|PROT_WRITE|PROT_EXEC, 
MAP_ANONYMOUS|MAP_PRIVATE, 0, 0); }

void mfree(void *ptr, size_t size) { munmap(ptr, size); }


Zoran Vasiljevic wrote:

On 19.12.2006, at 16:15, Vlad Seryakov wrote:

gdb may slow down concurrency, does it run without gdb, also does  
it run

with solaris malloc?


No problems. Runs with malloc and nedmalloc with or w/o gdb.
The same on Mac.




-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel



--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Zoran Vasiljevic



On 19.12.2006, at 16:15, Vlad Seryakov wrote:

gdb may slow down concurrency, does it run without gdb, also does  
it run

with solaris malloc?


No problems. Runs with malloc and nedmalloc with or w/o gdb.
The same on Mac.

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Vlad Seryakov

gdb may slow down concurrency, does it run without gdb, also does it run 
with solaris malloc?


Zoran Vasiljevic wrote:

On 19.12.2006, at 16:06, Vlad Seryakov wrote:


Yes, please


( I appended the code to the nedmalloc test program
   and renamed their main to main1)

bash-2.03$ gcc -O3 -o tcltest tcltest.c -lpthread -DNDEBUG - 
DTCL_THREADS -I/usr/local/include -L/usr/local/lib -ltcl8.4g

bash-2.03$ gdb ./tcltest
GNU gdb 6.0
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and  
you are
welcome to change it and/or distribute copies of it under certain  
conditions.

Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for  
details.

This GDB was configured as "sparc-sun-solaris2.8"...
(gdb) run
Starting program: /space/homes/zv/nedmalloc_tcl/tcltest
[New LWP 1]
[New LWP 2]
[New LWP 3]
[New LWP 4]
[New LWP 5]
[New LWP 6]
[New LWP 7]
[New LWP 8]
[LWP 7 exited]
[New LWP 7]
[LWP 4 exited]
[New LWP 4]
[LWP 8 exited]
[New LWP 8]

Program exited normally.
(gdb) quit



-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel



--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Zoran Vasiljevic



On 19.12.2006, at 16:06, Vlad Seryakov wrote:


Yes, please


( I appended the code to the nedmalloc test program
  and renamed their main to main1)

bash-2.03$ gcc -O3 -o tcltest tcltest.c -lpthread -DNDEBUG - 
DTCL_THREADS -I/usr/local/include -L/usr/local/lib -ltcl8.4g

bash-2.03$ gdb ./tcltest
GNU gdb 6.0
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and  
you are
welcome to change it and/or distribute copies of it under certain  
conditions.

Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for  
details.

This GDB was configured as "sparc-sun-solaris2.8"...
(gdb) run
Starting program: /space/homes/zv/nedmalloc_tcl/tcltest
[New LWP 1]
[New LWP 2]
[New LWP 3]
[New LWP 4]
[New LWP 5]
[New LWP 6]
[New LWP 7]
[New LWP 8]
[LWP 7 exited]
[New LWP 7]
[LWP 4 exited]
[New LWP 4]
[LWP 8 exited]
[New LWP 8]

Program exited normally.
(gdb) quit

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Vlad Seryakov


Yes, please

Zoran Vasiljevic wrote:

On 19.12.2006, at 15:57, Vlad Seryakov wrote:

Zoran, can you test it on Solaris and OSX so we'd know that is not  
Linux

related problem.


I have a Tcl library compiled with nedmalloc and when I link
against it and make

#define MemAlloc Tcl_Alloc
#define MemFree Tcl_Free

it runs fine. Shold I make the Solaris test?





-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel



--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Zoran Vasiljevic



On 19.12.2006, at 15:57, Vlad Seryakov wrote:



Zoran, can you test it on Solaris and OSX so we'd know that is not  
Linux

related problem.


I have a Tcl library compiled with nedmalloc and when I link
against it and make

#define MemAlloc Tcl_Alloc
#define MemFree Tcl_Free

it runs fine. Shold I make the Solaris test?

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Vlad Seryakov

I tried nedmalloc with LD_PRELOAD for my little test and it crashed vene 
before the start.


Zoran, can you test it on Solaris and OSX so we'd know that is not Linux 
related problem.



#include 

#include 
#include 
#include 
#include 
#include 

#define MemAlloc malloc
#define MemFree free

static int nbuffer = 16384;
static int nloops = 5;
static int nthreads = 4;

static void *gPtr = NULL;
static Tcl_Mutex gLock;

void MemThread(void *arg)
{
 int   i,n;
 void *ptr = NULL;

 for (i = 0; i < nloops; ++i) {
 n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0)));
 if (ptr != NULL) {
 MemFree(ptr);
 }
 ptr = MemAlloc(n);
 if (n % 50 == 0) {
 Tcl_MutexLock(&gLock);
 if (gPtr != NULL) {
 MemFree(gPtr);
 gPtr = NULL;
 } else {
 gPtr = MemAlloc(n);
 }
 Tcl_MutexUnlock(&gLock);
 }
 }
}

int main (int argc, char **argv)
{
 int i;
 Tcl_ThreadId *tids;

 tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads);

 for (i = 0; i < nthreads; ++i) {
 Tcl_CreateThread( &tids[i], MemThread, NULL,
TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE);
 }
 for (i = 0; i < nthreads; ++i) {
 Tcl_JoinThread(tids[i], NULL);
 }
}




Zoran Vasiljevic wrote:

On 19.12.2006, at 01:10, Stephen Deasey wrote:


This program allocates memory in a worker thread and frees it in the
main thread. If all free()'s put memory into a thread-local cache then
you would expect this program to bloat, but it doesn't, so I guess
it's not a problem (at least not on Fedora Core 5).


It is also not the case with nedmalloc as it specifically
tracks that usage pattern. The block being free'd "knows"
to which so-called mspace it belongs regardless which thread
free's it.

So, I'd say the nedmalloc is OK in this respect.
I have given it a purify run and it runs cleanly.
Our application is nnoticeably faster on Mac and
bloats less. But this is only a tip of the iceberg.
We yet have to give it a real stress-test on the
field, yet I'm reluctant to do this now and will
have to wait for a major release somewhere in spring
next year.




-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel



--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/

Re: [naviserver-devel] Quest for malloc

2006-12-19 Thread Zoran Vasiljevic



On 19.12.2006, at 01:10, Stephen Deasey wrote:


This program allocates memory in a worker thread and frees it in the
main thread. If all free()'s put memory into a thread-local cache then
you would expect this program to bloat, but it doesn't, so I guess
it's not a problem (at least not on Fedora Core 5).


It is also not the case with nedmalloc as it specifically
tracks that usage pattern. The block being free'd "knows"
to which so-called mspace it belongs regardless which thread
free's it.

So, I'd say the nedmalloc is OK in this respect.
I have given it a purify run and it runs cleanly.
Our application is nnoticeably faster on Mac and
bloats less. But this is only a tip of the iceberg.
We yet have to give it a real stress-test on the
field, yet I'm reluctant to do this now and will
have to wait for a major release somewhere in spring
next year.

Re: [naviserver-devel] Quest for malloc

2006-12-18 Thread Stephen Deasey

On 12/18/06, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote:

On 18.12.2006, at 19:57, Stephen Deasey wrote:
>
>
> One thing I wonder about this is, how do requests average out across
> all threads? If you set the conn threads to exit after 10,000
> requests, will they all quit at roughly the same time causing an
> extreme load on the server?  Also, this is only an option for conn
> threads. With scheduled proc threads, job threads etc. you get
> nothing.
>

Well, if they all start to exit at the same time, they will
serialize at the point where per-thread cache is pushed to
the shared pool.

I was worried more about things like all the Tcl procs needing to be
recompiled in the new interp for the thread, and all the other stuff
which is cached.  If threads exit regularly, say after 10,000
requests, and the requests average out over all threads, then your
site will regularly go down, effectively. It would be nice if we could
make sure the thread exits were spread out.

Anyway...

> I think some people are experiencing fragmentation problems with
> ptmalloc -- the Squid and OpenLDAP guys, for example.  There's also
> the malloc-in-one-thread, free-in-another problem, which if your
> threads don't exit is basically a leak.

Really a leak? Why? Wouln't that depend on the implementation?

Yes, and I thought that was the case with Linux ptmalloc, but maybe I
got it wrong or this is old news...

This program allocates memory in a worker thread and frees it in the
main thread. If all free()'s put memory into a thread-local cache then
you would expect this program to bloat, but it doesn't, so I guess
it's not a problem (at least not on Fedora Core 5).

#include 
#include 
#include 
#include 

#define MemAlloc malloc
#define MemFree  free

void *gPtr = NULL;

static void Thread(void *arg);
static void PrintMemUsage(const char *msg);

int
main (int argc, char **argv)
{
   Tcl_ThreadId tid;
   int  i;

   PrintMemUsage("start");

   for (i = 0; i < 10; ++i) {

   Tcl_CreateThread(&tid, Thread, NULL,
TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE);
   Tcl_JoinThread(tid, NULL);

   MemFree(gPtr);
   gPtr = NULL;
   }

   PrintMemUsage("stop");
}

static void
Thread(void *arg)
{
   assert(gPtr == NULL);
   gPtr = MemAlloc(1024);
   assert(gPtr != NULL);
}

static void
PrintMemUsage(const char *msg)
{
   FILE *f;
   int   m;

   f = fopen("/proc/self/statm", "r");
   if (f == NULL) {
   perror("fopen failed: ");
   exit(-1);
   }
   if (fscanf(f, "%d", &m) != 1) {
   perror("fscanf failed: ");
   exit(-1);
   }
   fclose(f);

   printf("%s: %d\n", msg, m);
}

Re: [naviserver-devel] Quest for malloc

2006-12-18 Thread Vlad Seryakov

I suspect something i am doing wrong, but still it crashes and i do not 
see it why


#include 

#include 
#include 
#include 
#include 
#include 

#define MemAlloc malloc
#define MemFree free

static int nbuffer = 16384;
static int nloops = 5;
static int nthreads = 4;

static void *gPtr = NULL;
static Tcl_Mutex gLock;

void MemThread(void *arg)
{
int   i,n;
void *ptr = NULL;

for (i = 0; i < nloops; ++i) {
n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0)));
if (ptr != NULL) {
MemFree(ptr);
}
ptr = MemAlloc(n);
if (n % 50 == 0) {
Tcl_MutexLock(&gLock);
if (gPtr != NULL) {
MemFree(gPtr);
gPtr = NULL;
} else {
gPtr = MemAlloc(n);
}
Tcl_MutexUnlock(&gLock);
}
}
}

int main (int argc, char **argv)
{
int i;
Tcl_ThreadId *tids;

tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads);

for (i = 0; i < nthreads; ++i) {
Tcl_CreateThread( &tids[i], MemThread, NULL, 
TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE);

}
for (i = 0; i < nthreads; ++i) {
Tcl_JoinThread(tids[i], NULL);
}
}



Stephen Deasey wrote:

On 12/18/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote:

Still, even without the last free and with mutex around it, it core
dumps in free(gPtr) during the loop.



OK.  Still doesn't mean your program is bug free  :-)

There's a lot of extra stuff going on in your example program that
makes it hard to see what's going on. I simplified it to this:


#include 
#include 
#include 


#define MemAlloc ckalloc
#define MemFree  ckfree


void *gPtr = NULL;  /* Global pointer to memory. */

void
Thread(void *arg)
{
assert(gPtr != NULL);

MemFree(gPtr);
gPtr = NULL;
}

int
main (int argc, char **argv)
{
Tcl_ThreadId tid;
int  i;

for (i = 0; i < 10; ++i) {

gPtr = MemAlloc(1024);
assert(gPtr != NULL);

Tcl_CreateThread(&tid, Thread, NULL,
 TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE);
Tcl_JoinThread(tid, NULL);

assert(gPtr == NULL);
}
}


Works for me.

I say you can allocate memory in one thread and free it in another.

Let me know what the bug turns out to be..!



Stephen Deasey wrote:

On 12/18/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote:

I tried to run this program, it crahses with all allocators on free when
it was allocated in other thread. zippy does it as well, i amnot sure
how Naviserver works then.


I don't think allocate in one thread, free in another is an unusual
strategy.  Googling around I see a lot of people doing it. There must
be some bugs in your program. Here's one:

At the end of MemThread() gPtr is checked and freed, but the gMutex is
not held. This thread may have finished it's tight loop, but the other
3 threads could still be running. Also, the gPtr is not set to NULL
after the free(), leading to a double free when the next thread checks
it.



#include 

#define MemAlloc ckalloc
#define MemFree ckfree

int nbuffer = 16384;
int nloops = 5;
int nthreads = 4;

int gAllocs = 0;
void *gPtr = NULL;
Tcl_Mutex gLock;

void MemThread(void *arg)
{
 int   i,n;
 void *ptr = NULL;

 for (i = 0; i < nloops; ++i) {
 n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0)));
 if (ptr != NULL) {
 MemFree(ptr);
 }
 ptr = MemAlloc(n);
 // Testing inter-thread alloc/free
 if (n % 5 == 0) {
 Tcl_MutexLock(&gLock);
 if (gPtr != NULL) {
 MemFree(gPtr);
 }
 gPtr = MemAlloc(n);
 gAllocs++;
 Tcl_MutexUnlock(&gLock);
 }
 }
 if (ptr != NULL) {
 MemFree(ptr);
 }
 if (gPtr != NULL) {
 MemFree(gPtr);
 }
}

void MemTime()
{
 int   i;
 Tcl_ThreadId *tids;
 tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads);

 for (i = 0; i < nthreads; ++i) {
 Tcl_CreateThread( &tids[i], MemThread, NULL,
TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE);
 }
 for (i = 0; i < nthreads; ++i) {
 Tcl_JoinThread(tids[i], NULL);
 }
}

int main (int argc, char **argv)
{
MemTime();
}

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel


--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/

Re: [naviserver-devel] Quest for malloc

2006-12-18 Thread Zoran Vasiljevic



On 18.12.2006, at 19:57, Stephen Deasey wrote:



Are you saying you tested your app on Linux with native malloc and
experienced no fragmentation/bloating?


No. I have seen bloating but less then on zippy. I saw some
bloating and fragmentation on all optimizing allocators I
have tested.



I think some people are experiencing fragmentation problems with
ptmalloc -- the Squid and OpenLDAP guys, for example.  There's also
the malloc-in-one-thread, free-in-another problem, which if your
threads don't exit is basically a leak.


Really a leak? Why? Wouln't that depend on the implementation?




Doesn't zippy also clear it's per-thread cache on exit?


No. It showels all the rest to shared pool. The shared
pool is never freed. Hence lots of bloating.



Actually, did you experiment with exiting the conn threads after X
requests? Seems to be one of the things AOL is recommending.


Most of our threads are Tcl threads, not conn threads. We create
them to do lots of different tasks. They are all rather short-lived.
Still, the mem footprint grows and grows...



One thing I wonder about this is, how do requests average out across
all threads? If you set the conn threads to exit after 10,000
requests, will they all quit at roughly the same time causing an
extreme load on the server?  Also, this is only an option for conn
threads. With scheduled proc threads, job threads etc. you get
nothing.



Well, if they all start to exit at the same time, they will
serialize at the point where per-thread cache is pushed to
the shared pool.




-- 
---

Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to  
share your

opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php? 
page=join.php&p=sourceforge&CID=DEVDEV

___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel

Re: [naviserver-devel] Quest for malloc

2006-12-18 Thread Zoran Vasiljevic



On 18.12.2006, at 22:08, Stephen Deasey wrote:



Works for me.

I say you can allocate memory in one thread and free it in another.


Nice. Well I can say that nedmalloc works, that is, that small
program runs to end w/o coring when compiled with nedmalloc.
Does this prove anything?

Re: [naviserver-devel] Quest for malloc

2006-12-18 Thread Stephen Deasey


On 12/18/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote:

Still, even without the last free and with mutex around it, it core
dumps in free(gPtr) during the loop.



OK.  Still doesn't mean your program is bug free  :-)

There's a lot of extra stuff going on in your example program that
makes it hard to see what's going on. I simplified it to this:


#include 
#include 
#include 


#define MemAlloc ckalloc
#define MemFree  ckfree


void *gPtr = NULL;  /* Global pointer to memory. */

void
Thread(void *arg)
{
   assert(gPtr != NULL);

   MemFree(gPtr);
   gPtr = NULL;
}

int
main (int argc, char **argv)
{
   Tcl_ThreadId tid;
   int  i;

   for (i = 0; i < 10; ++i) {

   gPtr = MemAlloc(1024);
   assert(gPtr != NULL);

   Tcl_CreateThread(&tid, Thread, NULL,
TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE);
   Tcl_JoinThread(tid, NULL);

   assert(gPtr == NULL);
   }
}


Works for me.

I say you can allocate memory in one thread and free it in another.

Let me know what the bug turns out to be..!



Stephen Deasey wrote:
> On 12/18/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote:
>> I tried to run this program, it crahses with all allocators on free when
>> it was allocated in other thread. zippy does it as well, i amnot sure
>> how Naviserver works then.
>
>
> I don't think allocate in one thread, free in another is an unusual
> strategy.  Googling around I see a lot of people doing it. There must
> be some bugs in your program. Here's one:
>
> At the end of MemThread() gPtr is checked and freed, but the gMutex is
> not held. This thread may have finished it's tight loop, but the other
> 3 threads could still be running. Also, the gPtr is not set to NULL
> after the free(), leading to a double free when the next thread checks
> it.
>
>
>> #include 
>>
>> #define MemAlloc ckalloc
>> #define MemFree ckfree
>>
>> int nbuffer = 16384;
>> int nloops = 5;
>> int nthreads = 4;
>>
>> int gAllocs = 0;
>> void *gPtr = NULL;
>> Tcl_Mutex gLock;
>>
>> void MemThread(void *arg)
>> {
>>  int   i,n;
>>  void *ptr = NULL;
>>
>>  for (i = 0; i < nloops; ++i) {
>>  n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0)));
>>  if (ptr != NULL) {
>>  MemFree(ptr);
>>  }
>>  ptr = MemAlloc(n);
>>  // Testing inter-thread alloc/free
>>  if (n % 5 == 0) {
>>  Tcl_MutexLock(&gLock);
>>  if (gPtr != NULL) {
>>  MemFree(gPtr);
>>  }
>>  gPtr = MemAlloc(n);
>>  gAllocs++;
>>  Tcl_MutexUnlock(&gLock);
>>  }
>>  }
>>  if (ptr != NULL) {
>>  MemFree(ptr);
>>  }
>>  if (gPtr != NULL) {
>>  MemFree(gPtr);
>>  }
>> }
>>
>> void MemTime()
>> {
>>  int   i;
>>  Tcl_ThreadId *tids;
>>  tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads);
>>
>>  for (i = 0; i < nthreads; ++i) {
>>  Tcl_CreateThread( &tids[i], MemThread, NULL,
>> TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE);
>>  }
>>  for (i = 0; i < nthreads; ++i) {
>>  Tcl_JoinThread(tids[i], NULL);
>>  }
>> }
>>
>> int main (int argc, char **argv)
>> {
>> MemTime();
>> }
>
> -
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ___
> naviserver-devel mailing list
> naviserver-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/naviserver-devel
>

--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel

Re: [naviserver-devel] Quest for malloc

2006-12-18 Thread Vlad Seryakov

Still, even without the last free and with mutex around it, it core 
dumps in free(gPtr) during the loop.


Stephen Deasey wrote:

On 12/18/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote:

I tried to run this program, it crahses with all allocators on free when
it was allocated in other thread. zippy does it as well, i amnot sure
how Naviserver works then.



I don't think allocate in one thread, free in another is an unusual
strategy.  Googling around I see a lot of people doing it. There must
be some bugs in your program. Here's one:

At the end of MemThread() gPtr is checked and freed, but the gMutex is
not held. This thread may have finished it's tight loop, but the other
3 threads could still be running. Also, the gPtr is not set to NULL
after the free(), leading to a double free when the next thread checks
it.



#include 

#define MemAlloc ckalloc
#define MemFree ckfree

int nbuffer = 16384;
int nloops = 5;
int nthreads = 4;

int gAllocs = 0;
void *gPtr = NULL;
Tcl_Mutex gLock;

void MemThread(void *arg)
{
 int   i,n;
 void *ptr = NULL;

 for (i = 0; i < nloops; ++i) {
 n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0)));
 if (ptr != NULL) {
 MemFree(ptr);
 }
 ptr = MemAlloc(n);
 // Testing inter-thread alloc/free
 if (n % 5 == 0) {
 Tcl_MutexLock(&gLock);
 if (gPtr != NULL) {
 MemFree(gPtr);
 }
 gPtr = MemAlloc(n);
 gAllocs++;
 Tcl_MutexUnlock(&gLock);
 }
 }
 if (ptr != NULL) {
 MemFree(ptr);
 }
 if (gPtr != NULL) {
 MemFree(gPtr);
 }
}

void MemTime()
{
 int   i;
 Tcl_ThreadId *tids;
 tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads);

 for (i = 0; i < nthreads; ++i) {
 Tcl_CreateThread( &tids[i], MemThread, NULL,
TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE);
 }
 for (i = 0; i < nthreads; ++i) {
 Tcl_JoinThread(tids[i], NULL);
 }
}

int main (int argc, char **argv)
{
MemTime();
}


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel



--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/

Re: [naviserver-devel] Quest for malloc

2006-12-18 Thread Stephen Deasey


On 12/18/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote:

I tried to run this program, it crahses with all allocators on free when
it was allocated in other thread. zippy does it as well, i amnot sure
how Naviserver works then.



I don't think allocate in one thread, free in another is an unusual
strategy.  Googling around I see a lot of people doing it. There must
be some bugs in your program. Here's one:

At the end of MemThread() gPtr is checked and freed, but the gMutex is
not held. This thread may have finished it's tight loop, but the other
3 threads could still be running. Also, the gPtr is not set to NULL
after the free(), leading to a double free when the next thread checks
it.



#include 

#define MemAlloc ckalloc
#define MemFree ckfree

int nbuffer = 16384;
int nloops = 5;
int nthreads = 4;

int gAllocs = 0;
void *gPtr = NULL;
Tcl_Mutex gLock;

void MemThread(void *arg)
{
 int   i,n;
 void *ptr = NULL;

 for (i = 0; i < nloops; ++i) {
 n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0)));
 if (ptr != NULL) {
 MemFree(ptr);
 }
 ptr = MemAlloc(n);
 // Testing inter-thread alloc/free
 if (n % 5 == 0) {
 Tcl_MutexLock(&gLock);
 if (gPtr != NULL) {
 MemFree(gPtr);
 }
 gPtr = MemAlloc(n);
 gAllocs++;
 Tcl_MutexUnlock(&gLock);
 }
 }
 if (ptr != NULL) {
 MemFree(ptr);
 }
 if (gPtr != NULL) {
 MemFree(gPtr);
 }
}

void MemTime()
{
 int   i;
 Tcl_ThreadId *tids;
 tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads);

 for (i = 0; i < nthreads; ++i) {
 Tcl_CreateThread( &tids[i], MemThread, NULL,
TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE);
 }
 for (i = 0; i < nthreads; ++i) {
 Tcl_JoinThread(tids[i], NULL);
 }
}

int main (int argc, char **argv)
{
MemTime();
}

Re: [naviserver-devel] Quest for malloc

2006-12-18 Thread Vlad Seryakov

I tried to run this program, it crahses with all allocators on free when 
it was allocated in other thread. zippy does it as well, i amnot sure 
how Naviserver works then.



#include 

#define MemAlloc ckalloc
#define MemFree ckfree

int nbuffer = 16384;
int nloops = 5;
int nthreads = 4;

int gAllocs = 0;
void *gPtr = NULL;
Tcl_Mutex gLock;

void MemThread(void *arg)
{
int   i,n;
void *ptr = NULL;

for (i = 0; i < nloops; ++i) {
n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0)));
if (ptr != NULL) {
MemFree(ptr);
}
ptr = MemAlloc(n);
// Testing inter-thread alloc/free
if (n % 5 == 0) {
Tcl_MutexLock(&gLock);
if (gPtr != NULL) {
MemFree(gPtr);
}
gPtr = MemAlloc(n);
gAllocs++;
Tcl_MutexUnlock(&gLock);
}
}
if (ptr != NULL) {
MemFree(ptr);
}
if (gPtr != NULL) {
MemFree(gPtr);
}
}

void MemTime()
{
int   i;
Tcl_ThreadId *tids;
tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads);

for (i = 0; i < nthreads; ++i) {
Tcl_CreateThread( &tids[i], MemThread, NULL, 
TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE);

}
for (i = 0; i < nthreads; ++i) {
Tcl_JoinThread(tids[i], NULL);
}
}

int main (int argc, char **argv)
{
   MemTime();
}



Doesn't zippy also clear it's per-thread cache on exit?



It puts blocks into shared queue which other threads can re-use.
But shared cache never gets returned so conn threads exit will not help 
with memory bloat.



--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/

Re: [naviserver-devel] Quest for malloc

2006-12-18 Thread Stephen Deasey

On 12/18/06, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote:

On 16.12.2006, at 19:31, Vlad Seryakov wrote:

> But if speed is not important to you, you can supply Tcl without
> zippy,
> then no bloat, system is returned with reasonable speed, at least on
> Linux, ptmalloc is not that bad

OK. I think I've reached the peace of mind with all this
alternate malloc implementations...

This is what I found:

On all plaforms (except the Mac OSX), it really does
not pay to use anything else beside system native
malloc. I mean, you can gain some percent of speed
with hoard/tcmalloc/nedmalloc/zippy and friends, but you
pay this with bloating memory.

Are you saying you tested your app on Linux with native malloc and
experienced no fragmentation/bloating?

I think some people are experiencing fragmentation problems with
ptmalloc -- the Squid and OpenLDAP guys, for example.  There's also
the malloc-in-one-thread, free-in-another problem, which if your
threads don't exit is basically a leak.

If it's not a problem for your app then great!  Just wondering...

If you can afford it,
then go ahead. I believe, at least from what I've seen
from my tests, that zippy is quite fast and you gain
very little, if at all (speedwise) by replacing it.
You can gain some less memory fragmentation by using
something else, but this is not a thing that would
make me say: Wow!

Exception to that is really Mac OSX. The native Mac OSX
malloc sucks tremendously. The speed increase by zippy
and nedmalloc are so high that you can really see
(without any fancy measurements), how your application
flies! The nedmalloc also bloats less than zippy (normally,
as it clears per-thread cache on thread exit).

Doesn't zippy also clear it's per-thread cache on exit?

Actually, did you experiment with exiting the conn threads after X
requests? Seems to be one of the things AOL is recommending.

One thing I wonder about this is, how do requests average out across
all threads? If you set the conn threads to exit after 10,000
requests, will they all quit at roughly the same time causing an
extreme load on the server?  Also, this is only an option for conn
threads. With scheduled proc threads, job threads etc. you get
nothing.

So for the Mac (at least for us) I will stick to nedmalloc.
It is lightingly fast and reasonably conservative with
memory fragmentation.

Conclusion:

Linux/solaris = use system malloc
Mac OSX = use nedmalloc

Ah, yes... windows... this I haven't tested but nedmalloc
author shows some very interesting numbers on his site.
I somehow tend to believe them as some I have seen by
myself when experimenting on unix platforms. So, most
probably the outcome will be:

Windows = use nedmalloc

What this means to all of us:? I would say: very little.
We know that zippy is bloating and now we know that is
reasonably fast and on-pair with most of the other solutions
out there. For people concerned with speed, I believe this
is the right solution. For people concerned with speed AND
memory fragmentation (in that order) the best is to use some
alternative malloc routines. For people concerned with fragmentation
the best is to stay with system malloc; exception: Mac OSX.
There you just need to use something else and nedmalloc is the
only thing that compiles (and works) there, to my knowledge.

I hope I could help somebody with this report.

Cheers
Zoran

Re: [naviserver-devel] Quest for malloc

2006-12-18 Thread Zoran Vasiljevic



On 16.12.2006, at 19:31, Vlad Seryakov wrote:

But if speed is not important to you, you can supply Tcl without  
zippy,

then no bloat, system is returned with reasonable speed, at least on
Linux, ptmalloc is not that bad



OK. I think I've reached the peace of mind with all this
alternate malloc implementations...

This is what I found:

On all plaforms (except the Mac OSX), it really does
not pay to use anything else beside system native
malloc. I mean, you can gain some percent of speed
with hoard/tcmalloc/nedmalloc/zippy and friends, but you
pay this with bloating memory. If you can afford it,
then go ahead. I believe, at least from what I've seen
from my tests, that zippy is quite fast and you gain
very little, if at all (speedwise) by replacing it.
You can gain some less memory fragmentation by using
something else, but this is not a thing that would
make me say: Wow!

Exception to that is really Mac OSX. The native Mac OSX
malloc sucks tremendously. The speed increase by zippy
and nedmalloc are so high that you can really see
(without any fancy measurements), how your application
flies! The nedmalloc also bloats less than zippy (normally,
as it clears per-thread cache on thread exit).
So for the Mac (at least for us) I will stick to nedmalloc.
It is lightingly fast and reasonably conservative with
memory fragmentation.

Conclusion:

   Linux/solaris = use system malloc
   Mac OSX = use nedmalloc

Ah, yes... windows... this I haven't tested but nedmalloc
author shows some very interesting numbers on his site.
I somehow tend to believe them as some I have seen by
myself when experimenting on unix platforms. So, most
probably the outcome will be:

   Windows = use nedmalloc

What this means to all of us:? I would say: very little.
We know that zippy is bloating and now we know that is
reasonably fast and on-pair with most of the other solutions
out there. For people concerned with speed, I believe this
is the right solution. For people concerned with speed AND
memory fragmentation (in that order) the best is to use some
alternative malloc routines. For people concerned with fragmentation
the best is to stay with system malloc; exception: Mac OSX.
There you just need to use something else and nedmalloc is the
only thing that compiles (and works) there, to my knowledge.

I hope I could help somebody with this report.

Cheers
Zoran

Re: [naviserver-devel] Quest for malloc

2006-12-16 Thread Zoran Vasiljevic



On 16.12.2006, at 19:31, Vlad Seryakov wrote:


Linux, ptmalloc is not that bad


Interestingly. ptmalloc3 (http://www.malloc.de/) and
nedmalloc both diverge from dlmalloc (http://gee.cs.oswego.edu/malloc.h)
library from Doug lea. Consequently, their performance
is similar (nedmalloc being slight faster).
I have been able to verify this on the Linux box.

Re: [naviserver-devel] Quest for malloc

2006-12-16 Thread Zoran Vasiljevic



On 16.12.2006, at 19:31, Vlad Seryakov wrote:

But if speed is not important to you, you can supply Tcl without  
zippy,

then no bloat, system is returned with reasonable speed, at least on
Linux, ptmalloc is not that bad


Eh... Vlad...

On the Mac the nedmalloc outperforms the standard allocator
about 25 - 30 times! The same with the zippy.
All tested with the supplied test program.
I yet have to get real app tested...

On other platforms (Linux, Solaris) yes, I can stay
with the standard allocator. As the matter of fact,
they are close to the nedmalloc +/- about 10-30%
(in favour of nedmalloc, except on Sun/sparc).

One shoe does not fit all, unfortunately...

What I absolutely do not understand is: WHY?
I mean, why I get 30 times difference!? It just
makes no sense, but it is really true.
I am absolutely confused :-((

Re: [naviserver-devel] Quest for malloc

2006-12-16 Thread Vlad Seryakov

But if speed is not important to you, you can supply Tcl without zippy, 
then no bloat, system is returned with reasonable speed, at least on 
Linux, ptmalloc is not that bad


Zoran Vasiljevic wrote:

On 16.12.2006, at 16:25, Stephen Deasey wrote:


Something to think about: does the nedmalloc test include allocating
memory in one thread and freeing it in another?  Apparently this is
tough for some allocators, such as Linux ptmalloc. Naviserver does
this.


I'm still not 100% ready reading the code but:

The Tcl allocator just puts the free'd memory
in the cache of the current thread that calls
free(). On thread exit, or of the size of the
cache exceeds some limit, the content of the cache
is appended to shared cache. The memory is never
returned to the system, unless it is allocated
as a chunk larger that 16K.

The nedmalloc does the same but does not move
freed memory between the per-thread cache and
the shared repository. Instead, the thread cache
is emptied (freed) when a thread exits. This
must be explicitly called by the user.

As I see: all is green. But will pay more attention
to that by reading the code more carefully... Perhaps
there is some gotcha there which I would not like to
discover at the customer site ;-)

In nedmalloc you can disable the per-thread cache
usage by defining -DTHREADCACHEMAX=0 during compilation.
This makes some difference:

Testing nedmalloc with 5 threads ...
This allocator achieves 16194016.581962ops/sec under 5 threads

w/o cache versus

Testing nedmalloc with 5 threads ...
This allocator achieves 18895753.973492ops/sec under 5 threads

with the cache. The THREADCACHEMAX defines the size of
the allocation which goes into cache, similarily to the
zippy. The default is 8K (vs. 16K with zippy). The above
figures were done with max 8K size. If you increase it
to 16K the malloc cores :-( Too bad.

Still, I believe that for long running processes, the
approach of never releasing memory to the OS, as zippy
is doing, is suboptimal. Speed here or there, I'd rather
save myself process reboots if possible...
Bad thing is that Tcl allocator (aka zippy) will not
allow me any choice but bloat.  And this is becomming
more and more important. At some customers site I have
observed process sizes of 1.5GB whereas we started with
about 80MB. Eh!







-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel




--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/

Re: [naviserver-devel] Quest for malloc

2006-12-16 Thread Zoran Vasiljevic



On 16.12.2006, at 16:25, Stephen Deasey wrote:


Something to think about: does the nedmalloc test include allocating
memory in one thread and freeing it in another?  Apparently this is
tough for some allocators, such as Linux ptmalloc. Naviserver does
this.


I'm still not 100% ready reading the code but:

The Tcl allocator just puts the free'd memory
in the cache of the current thread that calls
free(). On thread exit, or of the size of the
cache exceeds some limit, the content of the cache
is appended to shared cache. The memory is never
returned to the system, unless it is allocated
as a chunk larger that 16K.

The nedmalloc does the same but does not move
freed memory between the per-thread cache and
the shared repository. Instead, the thread cache
is emptied (freed) when a thread exits. This
must be explicitly called by the user.

As I see: all is green. But will pay more attention
to that by reading the code more carefully... Perhaps
there is some gotcha there which I would not like to
discover at the customer site ;-)

In nedmalloc you can disable the per-thread cache
usage by defining -DTHREADCACHEMAX=0 during compilation.
This makes some difference:

   Testing nedmalloc with 5 threads ...
   This allocator achieves 16194016.581962ops/sec under 5 threads

w/o cache versus

   Testing nedmalloc with 5 threads ...
   This allocator achieves 18895753.973492ops/sec under 5 threads

with the cache. The THREADCACHEMAX defines the size of
the allocation which goes into cache, similarily to the
zippy. The default is 8K (vs. 16K with zippy). The above
figures were done with max 8K size. If you increase it
to 16K the malloc cores :-( Too bad.

Still, I believe that for long running processes, the
approach of never releasing memory to the OS, as zippy
is doing, is suboptimal. Speed here or there, I'd rather
save myself process reboots if possible...
Bad thing is that Tcl allocator (aka zippy) will not
allow me any choice but bloat.  And this is becomming
more and more important. At some customers site I have
observed process sizes of 1.5GB whereas we started with
about 80MB. Eh!

Re: [naviserver-devel] Quest for malloc

2006-12-16 Thread Zoran Vasiljevic



On 16.12.2006, at 17:29, Vlad Seryakov wrote:


Instead of using threadspeed or other simple malloc/free test, i used
naviserver and Tcl pages as test for allocators.
Using ab from apache and stresstest it for thousand requests i test
several allocators. And
having everything the same except LD_PRELOAD the difference seems  
pretty

clear. Hoard/TCmalloc/Ptmalloc2 all
slower than zippy, no doubt. Using threadtest although, tcmalloc was
faster than zippy, but in real life it behaves differently.

So, i would suggest to you to try hit naviserver with nedmalloc. If it
will be always faster than zippy, than you got what you want. Other
thinks to watch, after each test see the size of nsd process.

I will try nedmaloc as well later today


Indeed, the best way is to checkout the real application.
No test program can give you better picture!

As far as this is concerned, I do plan to make this test
but it takes some time! I spend the whole day getting the
nedmalloc compiling OK on all platform that we use
(solaris sparc/x86, mac ppc/x86, linux/x86, win). The next
step is to snap it in the Tcl library and try the real
application...

Re: [naviserver-devel] Quest for malloc

2006-12-16 Thread Vlad Seryakov

You can, it moves Tcl_Objs struct between thread and shared pools, same 
goes with other memory blocks.On thread exit

all memory goes to shared pool.

Zoran Vasiljevic wrote:

On 16.12.2006, at 17:15, Stephen Deasey wrote:

  

Yeah, pretty sure.  You can only use Tcl objects within a single
interp, which is restricted to a single thread, but general
ns_malloc'd memory chunks can be passed around between threads. It
would suck pretty hard if that wasn't the case.



Interesting... I could swear I read it that you
can't just alloc in one and free in other thread
using the Tcl allocator.

Well, regarding the nedmalloc, I do not know, but
I can find out...




-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel

Re: [naviserver-devel] Quest for malloc

2006-12-16 Thread Vlad Seryakov

Instead of using threadspeed or other simple malloc/free test, i used 
naviserver and Tcl pages as test for allocators.
Using ab from apache and stresstest it for thousand requests i test 
several allocators. And
having everything the same except LD_PRELOAD the difference seems pretty 
clear. Hoard/TCmalloc/Ptmalloc2 all
slower than zippy, no doubt. Using threadtest although, tcmalloc was 
faster than zippy, but in real life it behaves differently.


So, i would suggest to you to try hit naviserver with nedmalloc. If it 
will be always faster than zippy, than you got what you want. Other

thinks to watch, after each test see the size of nsd process.

I will try nedmaloc as well later today


Stephen Deasey wrote:

On 12/16/06, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote:
  

Are you sure? AFAIK, we just go down to Tcl_Alloc in Tcl library.
The allocator there will not allow you that. There were some discussions
on comp.lang.tcl about it (Jeff Hobbs knows better). As they (Tcl)
just "inherited" what aolserver had at that time (I believe V4.0)
the same what applies to AS applies to Tcl and indirectly to us.





Yeah, pretty sure.  You can only use Tcl objects within a single
interp, which is restricted to a single thread, but general
ns_malloc'd memory chunks can be passed around between threads. It
would suck pretty hard if that wasn't the case.

We have a bunch of reference counted stuff, cache values for example,
which we share among threads and delete when the reference count drops
to zero.  You can ns_register_proc from any thread, which needs to
ns_free the old value...

Here's the (a?) problem:

http://www.bozemanpass.com/info/linux/malloc/Linux_Heap_Contention.html

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel

Re: [naviserver-devel] Quest for malloc

2006-12-16 Thread Zoran Vasiljevic



On 16.12.2006, at 17:15, Stephen Deasey wrote:



Yeah, pretty sure.  You can only use Tcl objects within a single
interp, which is restricted to a single thread, but general
ns_malloc'd memory chunks can be passed around between threads. It
would suck pretty hard if that wasn't the case.


Interesting... I could swear I read it that you
can't just alloc in one and free in other thread
using the Tcl allocator.

Well, regarding the nedmalloc, I do not know, but
I can find out...

Re: [naviserver-devel] Quest for malloc

2006-12-16 Thread Stephen Deasey


On 12/16/06, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote:


Are you sure? AFAIK, we just go down to Tcl_Alloc in Tcl library.
The allocator there will not allow you that. There were some discussions
on comp.lang.tcl about it (Jeff Hobbs knows better). As they (Tcl)
just "inherited" what aolserver had at that time (I believe V4.0)
the same what applies to AS applies to Tcl and indirectly to us.




Yeah, pretty sure.  You can only use Tcl objects within a single
interp, which is restricted to a single thread, but general
ns_malloc'd memory chunks can be passed around between threads. It
would suck pretty hard if that wasn't the case.

We have a bunch of reference counted stuff, cache values for example,
which we share among threads and delete when the reference count drops
to zero.  You can ns_register_proc from any thread, which needs to
ns_free the old value...

Here's the (a?) problem:

http://www.bozemanpass.com/info/linux/malloc/Linux_Heap_Contention.html

Re: [naviserver-devel] Quest for malloc

2006-12-16 Thread Zoran Vasiljevic



On 15.12.2006, at 19:59, Vlad Seryakov wrote:


 Will try this one.


To aid you (and others):

http://www.archiware.com/downloads/nedmalloc_tcl.tar.gz

Download and peek at README file. This compiles on all
machines I tested and works pretty fine in terms of speed.
I haven't tested the memory size nor have any idea about
fragmentation, but the speed is pretty good.

Just look what this does on the Mac Pro (http://www.apple.com/macpro)
which is currently the fastest Mac available:


   Testing standard allocator with 5 threads ...
   This allocator achieves 531241.923013ops/sec under 5 threads

   Testing Tcl allocator with 5 threads ...
   This allocator achieves 439181.119284ops/sec under 5 threads

   Testing nedmalloc with 5 threads ...
   This allocator achieves 4137423.021490ops/sec under 5 threads


nedmalloc allocator is 7.788209 times faster than standard


Tcl allocator is 0.826706 times faster than standard


nedmalloc is 9.420767 times faster than Tcl allocator

Hm... if I was not able to get same/similar results
on other Mac's, I'd say this is a cheat. But it isn't.

Zoran

Re: [naviserver-devel] Quest for malloc

2006-12-16 Thread Zoran Vasiljevic



On 16.12.2006, at 16:25, Stephen Deasey wrote:


The seem, in the end, to go for Google tcmalloc. It wasn't the
absolute fastest for their particular set of tests, but had
dramatically lower memory usage.


The down side of tcmalloc: only Linux port.

The nedmalloc does them all (win, solaris, linux, macosx) as
it is written in ANSI-C and designed to be portable.
I tested all our Unix boxes and was able to get it running
on all of them. And the integration is rather simple, just
add:

 #include 
 #define malloc  nedmalloc
 #define realloc nedrealloc
 #define freenedfree

I believe this needs to be done in just one Tcl source file.
Trickier part: you need to call neddisablethreadcache(0)
at every thread exit.

The lower memory usage is important of course. Here I have
no experience yet.



Something to think about: does the nedmalloc test include allocating
memory in one thread and freeing it in another?  Apparently this is
tough for some allocators, such as Linux ptmalloc. Naviserver does
this.


Are you sure? AFAIK, we just go down to Tcl_Alloc in Tcl library.
The allocator there will not allow you that. There were some discussions
on comp.lang.tcl about it (Jeff Hobbs knows better). As they (Tcl)
just "inherited" what aolserver had at that time (I believe V4.0)
the same what applies to AS applies to Tcl and indirectly to us.

Re: [naviserver-devel] Quest for malloc

2006-12-16 Thread Stephen Deasey


On 12/16/06, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote:


Hey! I think our customers will love it! I will now try to
ditch the zippy and replace it with nedmalloc... Too bad that
Tcl as-is does not allow easy snap-in of alternate memory allocators.
I think this should be lobbied for.



It would be nice to at least have a configure switch for the zippy
allocator rather than having to hack up the Makefile.

Re: [naviserver-devel] Quest for malloc

2006-12-16 Thread Stephen Deasey

On 12/16/06, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote:

On 15.12.2006, at 19:59, Vlad Seryakov wrote:

>>
>> http://www.nedprod.com/programs/portable/nedmalloc/index.html

Hm... not bad at all:
This was under Solaris 2.8 on a Sun Blade2500 (Sparc) 1GB memory:

 Testing standard allocator with 8 threads ...
 This allocator achieves 2098770.683107ops/sec under 8 threads

 Testing nedmalloc with 8 threads ...
 This allocator achieves 1974570.587561ops/sec under 8 threads

 Testing Tcl alloc  with 8 threads ...
 This allocator achieves 1449969.176647ops/sec under 8 threads

Now on a SuSE Linux, a 1.8GHz Intel:

 Testing standard allocator with 8 threads ...
 This allocator achieves 1752893.072620ops/sec under 8 threads

 Testing nedmalloc with 8 threads ...
 This allocator achieves 2114564.246869ops/sec under 8 threads

 Testing Tcl alloc  with 8 threads ...
 This allocator achieves 1460851.824732ops/sec under 8 threads

The Tcl library was compiled for threads and uses the zippy
allocator. This is how I compiled the test program from the
nedmalloc package:

gcc -O -g -o test test.c -lpthread -DNDEBUG -DTCL_THREADS -I/usr/
local/include -L/usr/local/lib -ltcl8.4g

I had to make some tweaks as they have a problem in pthread_islocked()
private call. Also, I expanded the testsuite to include Tcl_Alloc/
Tcl_Free
in addition.

If I run this same thing on other platforms I get more/less same
results with one notable exception:

   o. nedmalloc is always faster then standard or zippy, except on
Sun Sparc
  where the built-in malloc is the fastest

   o. zippy (Tcl) allocator is always the slowest among the three

Now, I imagine, the nedmalloc test program may not be telling all the
truth
(i.e. may be biased towards nedmalloc)...

It would be interesting to see some other metrics...

Some other metrics:

 http://archive.netbsd.se/?ml=OpenLDAP-devel&a=2006-07&t=2172728

The seem, in the end, to go for Google tcmalloc. It wasn't the
absolute fastest for their particular set of tests, but had
dramatically lower memory usage.

Something to think about: does the nedmalloc test include allocating
memory in one thread and freeing it in another?  Apparently this is
tough for some allocators, such as Linux ptmalloc. Naviserver does
this.

Re: [naviserver-devel] Quest for malloc

2006-12-16 Thread Zoran Vasiljevic



On 16.12.2006, at 15:00, Zoran Vasiljevic wrote:



On 15.12.2006, at 19:59, Vlad Seryakov wrote:



http://www.nedprod.com/programs/portable/nedmalloc/index.html



Hm... not bad at all:


This was on a iMac with Intel Dual Core 1.83 Ghz and 512 MB memory

 Testing standard allocator with 8 threads ...
 This allocator achieves 319503.459835ops/sec under 8 threads

 Testing nedmalloc with 8 threads ...
 This allocator achieves 1687884.294403ops/sec under 8 threads

 Testing Tcl alloc  with 8 threads ...
 This allocator achieves 294571.750823ops/sec under 8 threads


Hey! I think our customers will love it! I will now try to
ditch the zippy and replace it with nedmalloc... Too bad that
Tcl as-is does not allow easy snap-in of alternate memory allocators.
I think this should be lobbied for.




This was under Solaris 2.8 on a Sun Blade2500 (Sparc) 1GB memory:

 Testing standard allocator with 8 threads ...
 This allocator achieves 2098770.683107ops/sec under 8 threads

 Testing nedmalloc with 8 threads ...
 This allocator achieves 1974570.587561ops/sec under 8 threads

 Testing Tcl alloc  with 8 threads ...
 This allocator achieves 1449969.176647ops/sec under 8 threads

Now on a SuSE Linux, a 1.8GHz Intel:

 Testing standard allocator with 8 threads ...
 This allocator achieves 1752893.072620ops/sec under 8 threads

 Testing nedmalloc with 8 threads ...
 This allocator achieves 2114564.246869ops/sec under 8 threads

 Testing Tcl alloc  with 8 threads ...
 This allocator achieves 1460851.824732ops/sec under 8 threads


The Tcl library was compiled for threads and uses the zippy
allocator. This is how I compiled the test program from the
nedmalloc package:

gcc -O -g -o test test.c -lpthread -DNDEBUG -DTCL_THREADS -I/usr/
local/include -L/usr/local/lib -ltcl8.4g

I had to make some tweaks as they have a problem in pthread_islocked()
private call. Also, I expanded the testsuite to include Tcl_Alloc/
Tcl_Free
in addition.

If I run this same thing on other platforms I get more/less same
results with one notable exception:

   o. nedmalloc is always faster then standard or zippy, except on
Sun Sparc
  where the built-in malloc is the fastest

   o. zippy (Tcl) allocator is always the slowest among the three

Now, I imagine, the nedmalloc test program may not be telling all the
truth
(i.e. may be biased towards nedmalloc)...

It would be interesting to see some other metrics...

Cheers
Zoran



-- 
---

Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to  
share your

opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php? 
page=join.php&p=sourceforge&CID=DEVDEV

___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel

Re: [naviserver-devel] Quest for malloc

2006-12-16 Thread Zoran Vasiljevic



On 15.12.2006, at 19:59, Vlad Seryakov wrote:



http://www.nedprod.com/programs/portable/nedmalloc/index.html



Hm... not bad at all:
This was under Solaris 2.8 on a Sun Blade2500 (Sparc) 1GB memory:

Testing standard allocator with 8 threads ...
This allocator achieves 2098770.683107ops/sec under 8 threads

Testing nedmalloc with 8 threads ...
This allocator achieves 1974570.587561ops/sec under 8 threads

Testing Tcl alloc  with 8 threads ...
This allocator achieves 1449969.176647ops/sec under 8 threads

Now on a SuSE Linux, a 1.8GHz Intel:

Testing standard allocator with 8 threads ...
This allocator achieves 1752893.072620ops/sec under 8 threads

Testing nedmalloc with 8 threads ...
This allocator achieves 2114564.246869ops/sec under 8 threads

Testing Tcl alloc  with 8 threads ...
This allocator achieves 1460851.824732ops/sec under 8 threads


The Tcl library was compiled for threads and uses the zippy
allocator. This is how I compiled the test program from the
nedmalloc package:

gcc -O -g -o test test.c -lpthread -DNDEBUG -DTCL_THREADS -I/usr/ 
local/include -L/usr/local/lib -ltcl8.4g


I had to make some tweaks as they have a problem in pthread_islocked()
private call. Also, I expanded the testsuite to include Tcl_Alloc/ 
Tcl_Free

in addition.

If I run this same thing on other platforms I get more/less same
results with one notable exception:

  o. nedmalloc is always faster then standard or zippy, except on  
Sun Sparc

 where the built-in malloc is the fastest

  o. zippy (Tcl) allocator is always the slowest among the three

Now, I imagine, the nedmalloc test program may not be telling all the  
truth

(i.e. may be biased towards nedmalloc)...

It would be interesting to see some other metrics...

Cheers
Zoran

Re: [naviserver-devel] Quest for malloc

2006-12-15 Thread Zoran Vasiljevic



On 15.12.2006, at 19:59, Vlad Seryakov wrote:

I also tried Hoard, Google tcmalloc, umem and some other rare  
mallocs i

could find. Still zippy beats everybody, i ran my speed test not
threadtest. Will try this one.


Important: it is not only raw speed, that is important but also
the memory fragmentation (i.e. lack of it).
In our app we must frequently reboot the server (each couple of
days) otherwise it just bloats. And... we made sure there are
no leaks (have purified all libs that we use)...

I now have some experience with the (zippy) fragmentation and I will
try to make a testbed with this allocator and run it for several
days to get some experience.

Cheers
Zoran

Re: [naviserver-devel] Quest for malloc

2006-12-15 Thread Vlad Seryakov

I also tried Hoard, Google tcmalloc, umem and some other rare mallocs i 
could find. Still zippy beats everybody, i ran my speed test not 
threadtest. Will try this one.


Zoran Vasiljevic wrote:

Hi!

I've tried libumem as Stephen suggested, but it is slower
than the regular system malloc. This (libumem) is really
geared toward the integration with the mdb (solaris modular
debugger) for memory debugging and analysis.

But, I've found:

http://www.nedprod.com/programs/portable/nedmalloc/index.html

and this looks more promising. I have run its (supplied)
test and it seems that, at least speedwise, the code is
faster than native OS malloc. I will now try to make it working
on all platforms that we use (admitently, it will not run
correctly if you do not set -DNDEBUG to silence some assertions;
this is of course not right and I have to see why/what).

Anyways perhaps a thing to try out...

If you get any breath-taking news with the above, share it here.
On my PPC powerbook (1.5GHZ PPC, 512 MB memory) I get improvements
over the built-in allocator of a factor of 3 (3 times better)
with far less system overehad. I cannot say nothing about the
fragmentation; this has yet to be tested.

Cheers
Zoran



-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel



--
Vlad Seryakov
571 262-8608 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/

83 matches

Mail list logo