Re: [naviserver-devel] Quest for malloc
Hi Jeff, we are aware that the funciton is essentially an integer log2. The chosen C-based variant is acually faster and more general than what you have included (it needs only max 2 shift operations for the relevant range) but the assembler based variant is hard to beat and yields another 3% for the performance of the benchmark on top of the fastest C version. Thanks for that! -gustaf Jeff Rogers schrieb: I don't think anyone has pointed this out yet, but this is a logarithm in base 2 (log2), and there are a fair number of implementations of this available; for maximum performance there are assembly implementations using 'bsr' on x86 architectures, such as this one from google's tcmalloc:
Re: [naviserver-devel] Quest for malloc
Gustaf Neumann wrote: This is most probably the best variabt so far, and not complicated, such a optimizer can do "the right thing" easily. sorry for the many versions.. -gustaf { unsigned register int s = (size-1) >> 3; while (s>1) { s >>= 1; bucket++; } } if (bucket > NBUCKETS) { bucket = NBUCKETS; } I don't think anyone has pointed this out yet, but this is a logarithm in base 2 (log2), and there are a fair number of implementations of this available; for maximum performance there are assembly implementations using 'bsr' on x86 architectures, such as this one from google's tcmalloc: // Return floor(log2(n)) for n > 0. #if (defined __i386__ || defined __x86_64__) && defined __GNUC__ static inline int LgFloor(size_t n) { // "ro" for the input spec means the input can come from either a // register ("r") or offsetable memory ("o"). size_t result; __asm__("bsr %1, %0" : "=r" (result) // Output spec : "ro" (n)// Input spec : "cc"// Clobbers condition-codes ); return result; } #else // Note: the following only works for "n"s that fit in 32-bits, but // that is fine since we only use it for small sizes. static inline int LgFloor(size_t n) { int log = 0; for (int i = 4; i >= 0; --i) { int shift = (1 << i); size_t x = n >> shift; if (x != 0) { n = x; log += shift; } } ASSERT(n == 1); return log; } #endif (Disclaimer - this comment is based on my explorations of zippy, not vt, so the logic may be entirely different) If this log2(requested_size) is used to translate directly index into the bucket table that necessarily restricts you to having power-of-2 bucket sizes, meaning you allocate on average nearly 50% more than requested (i.e., nearly 33% of allocated memory is overhead/wasted). Adding more, closer-spaced buckets adds to the base footprint but possibly reduces the max usage by dropping the wasted space. I believe tcmalloc uses buckets spaced so that the average waste is only 12.5%. -J
Re: [naviserver-devel] Quest for malloc
Am 16.01.2007 um 15:52 schrieb Zoran Vasiljevic: You see, even we (i.e. Mike) noticed one glitch in the test program that make Zippy look ridiculous on the Mac, although it wasn't. Hmhmhmh... I must have done something very wrong :-( When I now repeat the tests on Mac/Zippy, even with the size limited to 16000 bytes, it still performs miserably. For just one thread, it gives "decent" values (although still 2.5 times slower than VT). For two threads, it goes down to about 1/5th and so on... I have asked Gustaf to try to reproduce that on his Mac, as I slowly start to see white mice (no, I never drink _any_ alkohol)... If Gustaf confirms my findings, then we are still back where we were with Zippy. And yes, I have disabled that block splitting Mike was talking about in his email. So it is not that. And... it is not size of the allocation (> 16284) as I fixed that as well... Background: I wanted to update the README file with new performance values and found out that Zippy isn't changed, although I thought it was fixed with that size change... Hmmm... Zoran
Re: [naviserver-devel] Quest for malloc
Yes, it is combined version, but Tcl version is slightly different and Zoran took it over to maintain, in my tarball i include both, we do experiments in different directions and then combine best results. Also the intention was to try to include it in Tcl itself. Stephen Deasey wrote: On 1/16/07, Stephen Deasey <[EMAIL PROTECTED]> wrote: On 1/16/07, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote: Am 16.01.2007 um 12:18 schrieb Stephen Deasey: vtmalloc <-- add this It's there. Everybody can now contribute, if needed. Rocking. I suggest putting the 0.0.3 tarball up on sourceforge, announcing on Freshmeat, and cross-posting on the aolserver list. You really want random people with their random workloads on random OS to beat on this. I don't know if the pool of people here is large enough for that... I'm sure there's a lot of other people who would be interested in this, if they knew about it. Should probably cross-post here, for example: http://wiki.tcl.tk/9683 - Why Do Programs Take Up So Much Memory? Vlad's already on the ball... http://freshmeat.net/projects/vtmalloc/ - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/
Re: [naviserver-devel] Quest for malloc
Am 16.01.2007 um 15:41 schrieb Stephen Deasey: I suggest putting the 0.0.3 tarball up on sourceforge, announcing on Freshmeat, and cross-posting on the aolserver list. You really want random people with their random workloads on random OS to beat on this. I don't know if the pool of people here is large enough for that... I'm sure there's a lot of other people who would be interested in this, if they knew about it. Should probably cross-post here, for example: http://wiki.tcl.tk/9683 - Why Do Programs Take Up So Much Memory? The plan was to beat this beast first in the "family", then go to the next village (aol-list) and then visit the next town (tcl-core list), in that sequence. You see, even we (i.e. Mike) noticed one glitch in the test program that make Zippy look ridiculous on the Mac, although it wasn't. So we now have enough experience to go visit our neighbours and see what they'll say. On positive feedback, the next is Tcl core list. There I expect most fierce opposition to any change (which is understandable, given the size of the group of involved people and the kind of the change). Cheers Zoran
Re: [naviserver-devel] Quest for malloc
On 1/16/07, Stephen Deasey <[EMAIL PROTECTED]> wrote: On 1/16/07, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote: > > Am 16.01.2007 um 12:18 schrieb Stephen Deasey: > > > vtmalloc <-- add this > > It's there. Everybody can now contribute, if needed. > Rocking. I suggest putting the 0.0.3 tarball up on sourceforge, announcing on Freshmeat, and cross-posting on the aolserver list. You really want random people with their random workloads on random OS to beat on this. I don't know if the pool of people here is large enough for that... I'm sure there's a lot of other people who would be interested in this, if they knew about it. Should probably cross-post here, for example: http://wiki.tcl.tk/9683 - Why Do Programs Take Up So Much Memory? Vlad's already on the ball... http://freshmeat.net/projects/vtmalloc/
Re: [naviserver-devel] Quest for malloc
On 1/16/07, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote: Am 16.01.2007 um 12:18 schrieb Stephen Deasey: > vtmalloc <-- add this It's there. Everybody can now contribute, if needed. Rocking. I suggest putting the 0.0.3 tarball up on sourceforge, announcing on Freshmeat, and cross-posting on the aolserver list. You really want random people with their random workloads on random OS to beat on this. I don't know if the pool of people here is large enough for that... I'm sure there's a lot of other people who would be interested in this, if they knew about it. Should probably cross-post here, for example: http://wiki.tcl.tk/9683 - Why Do Programs Take Up So Much Memory?
Re: [naviserver-devel] Quest for malloc
Am 16.01.2007 um 12:18 schrieb Stephen Deasey: vtmalloc <-- add this It's there. Everybody can now contribute, if needed.
Re: [naviserver-devel] Quest for malloc
On 1/16/07, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote: Am 16.01.2007 um 10:37 schrieb Stephen Deasey: > > Can you import this into CVS? Top level. > You mean the tclThreadAlloc.c file on top-level of the naviserver project? The whole thing: README, licence, tests etc. By top level, I just mean not in the modules directory, because it isn't one. So, CVS: naviserver modules website vtmalloc <-- add this Unless you're planning to push this upstream in the next week or so. Or you really want to host this on your own website. It's a shame to have good work hidden in random places.
Re: [naviserver-devel] Quest for malloc
Zoran Vasiljevic schrieb: Guess what: it is _slower_ now then the s = (size-1) >> 3; while (s>1) {s >>= 1; bucket++;} I tend to like that one as it is really neat. It will also better illustrate what is being done. this is the last for today. It is the unrolled variant, with less tests, and still human readable. It should be faster than the unrolled while variants -gustaf { unsigned register int s = (size-1) >> 4; while (s >= 0x1000) { s >>= 12; bucket += 12; } if (s >= 0x0800) { s >>= 11; bucket += 11; } else if (s >= 0x0400) { s >>= 10; bucket += 10; } else if (s >= 0x0200) { s >>= 9; bucket += 9; } else if (s >= 0x0100) { s >>= 8; bucket += 8; } else if (s >= 0x0080) { s >>= 7; bucket += 7; } else if (s >= 0x0040) { s >>= 6; bucket += 6; } else if (s >= 0x0020) { s >>= 5; bucket += 5; } else if (s >= 0x0010) { s >>= 4; bucket += 4; } else if (s >= 0x0008) { s >>= 3; bucket += 3; } else if (s >= 0x0004) { s >>= 2; bucket += 2; } else if (s >= 0x0002) { s >>= 1; bucket += 1; } if (s >= 1) { bucket++; } if (bucket > NBUCKETS) { bucket = NBUCKETS; } } #
Re: [naviserver-devel] Quest for malloc
Am 16.01.2007 um 10:37 schrieb Stephen Deasey: Can you import this into CVS? Top level. You mean the tclThreadAlloc.c file on top-level of the naviserver project?
Re: [naviserver-devel] Quest for malloc
Am 16.01.2007 um 11:24 schrieb Gustaf Neumann: if all cases are used, all but the first loops are executed mostly once and could be changed into ifs... i will send you with a separate mail on such variant, but i am running currently out of battery. Guess what: it is _slower_ now then the s = (size-1) >> 3; while (s>1) {s >>= 1; bucket++;} I tend to like that one as it is really neat. It will also better illustrate what is being done. Watch: _slower_ means about 1-2%, so I do not believe we need to improve on that any more. The above version is I believe most "opportune" as it is readable (thus understandable) and very fast.
Re: [naviserver-devel] Quest for malloc
Zoran Vasiljevic schrieb: Am 16.01.2007 um 10:46 schrieb Gustaf Neumann: This is most probably the best variabt so far, and not complicated, such a optimizer can do "the right thing" easily. sorry for the many versions.. -gustaf { unsigned register int s = (size-1) >> 3; while (s>1) { s >>= 1; bucket++; } } if (bucket > NBUCKETS) { bucket = NBUCKETS; } You'd be surprised that this one i am. that's the story of the unrolled loops. Btw, the version you have listed as the fastest has wrong boundary tests (but still gives the same result. below is is corrected version, which needs up to one mio max 2 shift operations. The nice thing of this code (due to staggered whiles) is that any of the while loops (execpt the last) can be removed and the code works still correctly (but needs more shift operations). that's the reason, why yesterdays version actually works. if all cases are used, all but the first loops are executed mostly once and could be changed into ifs... i will send you with a separate mail on such variant, but i am running currently out of battery. while (s >= 0x1000) { s >>= 12; bucket += 12; } while (s >= 0x0800) { s >>= 11; bucket += 11; } while (s >= 0x0400) { s >>= 10; bucket += 10; } while (s >= 0x200) { s >>= 9; bucket += 9; } while (s >= 0x0100) { s >>= 8; bucket += 8; } while (s >= 0x80) { s >>= 7; bucket += 7; } while (s >= 0x40) { s >>= 6; bucket += 6; } while (s >= 0x20) { s >>= 5; bucket += 5; } while (s >= 0x10) { s >>= 4; bucket += 4; } while (s >= 0x08) { s >>= 3; bucket += 3; } while (s >= 0x04) { s >>= 2; bucket += 2; } while(s >= 1) { s >>= 1; bucket++; } if (bucket > NBUCKETS) { bucket = NBUCKETS; } Test Tcl allocator with 4 threads, 16000 records ... This allocator achieves 10098495 ops/sec under 4 threads Press return to exit (observe the current memory footprint!) whereas this one: s = (size-1) >> 3; while (s>1) { s >>= 1; bucket++;} gives: Test Tcl allocator with 4 threads, 16000 records ... This allocator achieves 9720847 ops/sec under 4 threads Press return to exit (observe the current memory footprint!) That is (10098495-9720847/10098495)*100 = 3% less That is all measured on Linux. I haven't done it on the Mac and on the Sun yet. I now have all versions inside and will play a little on each plaform to see which one operates best overall. The latest one is more appealing because of the siplicitly of the code, so we can close an eye on that 3% I guess. Cheers Zoran
Re: [naviserver-devel] Quest for malloc
Am 16.01.2007 um 10:46 schrieb Gustaf Neumann: s = (size-1) >> 3; while (s>1) { s >>= 1; bucket++; On Linux and Solaris (both x86 machines) the "long" version: s = (size-1) >> 4; while (s > 0xFF) { s = s >> 5; bucket += 5; } while (s > 0x0F) { s = s >> 4; bucket += 4; } ... is faster then the "short" above. On Mac OSX it is the same (no difference). Look the Sun Solaris 10 (x86 box): (the "short" version) Test Tcl allocator with 4 threads, 16000 records ... This allocator achieves 13753084 ops/sec under 4 threads Press return to exit (observe the current memory footprint!) (the "long" version) -bash-3.00$ ./memtest Test Tcl allocator with 4 threads, 16000 records ... This allocator achieves 14341236 ops/sec under 4 threads Press return to exit (observe the current memory footprint!) That is ((14341236-13753084)/14341236)*100 = 4% On Linux we had about 3% improvement. On Sun about 4% and on Mac OSX none. Note: all were x86 (Intel, AMD) machines just different OS and GHz-count. When we go back to the "slow" (original) version: Test Tcl allocator with 4 threads, 16000 records ... This allocator achieves 13474091 ops/sec under 4 threads Press return to exit (observe the current memory footprint!) We get ((14341236-13474091)/14341236)*100 = 6% improvement. Cheers Zoran
Re: [naviserver-devel] Quest for malloc
Am 16.01.2007 um 10:46 schrieb Gustaf Neumann: This is most probably the best variabt so far, and not complicated, such a optimizer can do "the right thing" easily. sorry for the many versions.. -gustaf { unsigned register int s = (size-1) >> 3; while (s>1) { s >>= 1; bucket++; } } if (bucket > NBUCKETS) { bucket = NBUCKETS; } You'd be surprised that this one s = (size-1) >> 4; while (s > 0xFF) { s = s >> 5; bucket += 5; } while (s > 0x0F) { s = s >> 4; bucket += 4; } ... gives: Test Tcl allocator with 4 threads, 16000 records ... This allocator achieves 10098495 ops/sec under 4 threads Press return to exit (observe the current memory footprint!) whereas this one: s = (size-1) >> 3; while (s>1) { s >>= 1; bucket++;} gives: Test Tcl allocator with 4 threads, 16000 records ... This allocator achieves 9720847 ops/sec under 4 threads Press return to exit (observe the current memory footprint!) That is (10098495-9720847/10098495)*100 = 3% less That is all measured on Linux. I haven't done it on the Mac and on the Sun yet. I now have all versions inside and will play a little on each plaform to see which one operates best overall. The latest one is more appealing because of the siplicitly of the code, so we can close an eye on that 3% I guess. Cheers Zoran
Re: [naviserver-devel] Quest for malloc
This is most probably the best variabt so far, and not complicated, such a optimizer can do "the right thing" easily. sorry for the many versions.. -gustaf { unsigned register int s = (size-1) >> 3; while (s>1) { s >>= 1; bucket++; } } if (bucket > NBUCKETS) { bucket = NBUCKETS; }
Re: [naviserver-devel] Quest for malloc
On 1/16/07, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote: Am 15.01.2007 um 22:37 schrieb Zoran Vasiljevic: > > Am 15.01.2007 um 22:22 schrieb Mike: > >> >> Zoran, I believe you misunderstood. The "patch" above limits blocks >> allocated by your tester to 16000 instead of 16384 blocks. The >> reason >> for this is that Zippy's "largest bucket" is configured to be >> 16284-sizeof(Block) bytes (note the "2" in 16_2_84 is _NOT_ a typo). >> By making uniformly random requests sizes up to 16_3_84, you are >> causing Zippy to fall back to system malloc for a small fraction of >> requests, substantially penalizing its performance in these cases. > > Ah! That's right. I will fix that. > >> >> You wanted to know why Zippy is slower on your test, this is the >> reason. This has substantial impact on FreeBSD and linux, and my >> guess is that it will have a drammatic effect on Mac OSX. > > I will check that tomorrow on my machines. YES. That did the trick. We have now demystified the behaviour on the Mac. Indeed, when I limit the max alloc size to below *16284* bytes, Zippy runs almosts as fast as VT alloc. So, it was my overlooking of the fact that it was 16284 and not 16K (16384) !! I wanted to give Zippy a fair chance but I missed that for about 100 bytes. Which made huge difference. Still, it shows again one of the weaknesses of Zippy: dependence of (potentially suboptimal) system memory allocator. But that is not to blame on zippy, rather on weak system malloc, as on the Mac. I guess same could have happened to us with a slow mmap()/munmap()... > >>> >>> How about adding this into the code? >> >> I think the most obvious replacement is just using an if "tree": >> if (size>0xff) bucket+=8, size&=0xff; >> if (size>0xf) bucket+=4, size&0xf; >> ... >> it takes a minute to get the math right, but the performance gain >> should be substantial. > > Well, I can test that allright. I have the feeling that a tight > loop as that (will mostly sping 5-12 times) gets well compiled > in machine code, but it is better to test. Allright. Gustaf came with this, and it saves about 10% of time: #if 0 while (bucket> 4; while (s > 0xFF) { s = s >> 5; bucket += 5; } while (s > 0x0F) { s = s >> 4; bucket += 4; } while (s > 0x08) { s = s >> 3; bucket += 3; } while (s > 0x04) { s = s >> 2; bucket += 2; } while (s > 0x00) { s = s >> 1; bucket++; } I will leave the above loop in the code and provide ifdef, as by looking at the below it is hard to understand what is really happening. But it works and it works fine. Cheers Zoran Can you import this into CVS? Top level.
Re: [naviserver-devel] Quest for malloc
Am 15.01.2007 um 22:37 schrieb Zoran Vasiljevic: Am 15.01.2007 um 22:22 schrieb Mike: Zoran, I believe you misunderstood. The "patch" above limits blocks allocated by your tester to 16000 instead of 16384 blocks. The reason for this is that Zippy's "largest bucket" is configured to be 16284-sizeof(Block) bytes (note the "2" in 16_2_84 is _NOT_ a typo). By making uniformly random requests sizes up to 16_3_84, you are causing Zippy to fall back to system malloc for a small fraction of requests, substantially penalizing its performance in these cases. Ah! That's right. I will fix that. You wanted to know why Zippy is slower on your test, this is the reason. This has substantial impact on FreeBSD and linux, and my guess is that it will have a drammatic effect on Mac OSX. I will check that tomorrow on my machines. YES. That did the trick. We have now demystified the behaviour on the Mac. Indeed, when I limit the max alloc size to below *16284* bytes, Zippy runs almosts as fast as VT alloc. So, it was my overlooking of the fact that it was 16284 and not 16K (16384) !! I wanted to give Zippy a fair chance but I missed that for about 100 bytes. Which made huge difference. Still, it shows again one of the weaknesses of Zippy: dependence of (potentially suboptimal) system memory allocator. But that is not to blame on zippy, rather on weak system malloc, as on the Mac. I guess same could have happened to us with a slow mmap()/munmap()... How about adding this into the code? I think the most obvious replacement is just using an if "tree": if (size>0xff) bucket+=8, size&=0xff; if (size>0xf) bucket+=4, size&0xf; ... it takes a minute to get the math right, but the performance gain should be substantial. Well, I can test that allright. I have the feeling that a tight loop as that (will mostly sping 5-12 times) gets well compiled in machine code, but it is better to test. Allright. Gustaf came with this, and it saves about 10% of time: #if 0 while (bucket[bucket].blocksize ++bucket; } #else s = (size-1) >> 4; while (s > 0xFF) { s = s >> 5; bucket += 5; } while (s > 0x0F) { s = s >> 4; bucket += 4; } while (s > 0x08) { s = s >> 3; bucket += 3; } while (s > 0x04) { s = s >> 2; bucket += 2; } while (s > 0x00) { s = s >> 1; bucket++; } I will leave the above loop in the code and provide ifdef, as by looking at the below it is hard to understand what is really happening. But it works and it works fine. Cheers Zoran
Re: [naviserver-devel] Quest for malloc
Am 15.01.2007 um 20:15 schrieb Stephen Deasey: Nobody yet gave any reasonable explanation why we are that fast on Mac OSX compared to any other allocator. Recall, that was 870.573/70.713.324 ops/sec Zippy/VT so about 81 times faster, for 16 threads. Although it really seems like a bug either in the testcode or in the allocator, I have not been able to verify any. All is working as it should. So, the mistery remains... Because Mac OSX SucksMonkeyBawlz() in a tight inner loop? Actually, Mike was right. The test pattern maxed the size to slightly above 16000 which turned Zippy back to system allocator and that alone screwed everything. When I limit the test program to allocate up to 16000 bytes but not more the performance of Zippy and VT are almost equal. So, the only thing that remains is the memory handling. But, as I stressed many times, our goal was to be +/- 25% to zippy performance with better memory handling (releasing memory to OS when possible). I still believe that we achieved our goal very well. But it is good to know why the difference on the Mac was so much higher then elsewhere. I guess if I repeat the Zippy/VT speed comparison on other platform, with 16000 bytes upper limit that performance difference will be little or none. Many thanks to Mike for good observation! Cheers Zoran
Re: [naviserver-devel] Quest for malloc
Am 15.01.2007 um 22:22 schrieb Mike: Zoran, I believe you misunderstood. The "patch" above limits blocks allocated by your tester to 16000 instead of 16384 blocks. The reason for this is that Zippy's "largest bucket" is configured to be 16284-sizeof(Block) bytes (note the "2" in 16_2_84 is _NOT_ a typo). By making uniformly random requests sizes up to 16_3_84, you are causing Zippy to fall back to system malloc for a small fraction of requests, substantially penalizing its performance in these cases. Ah! That's right. I will fix that. You wanted to know why Zippy is slower on your test, this is the reason. This has substantial impact on FreeBSD and linux, and my guess is that it will have a drammatic effect on Mac OSX. I will check that tomorrow on my machines. The benefit of mmap() is being able to "for sure" release memory back to the system. The drawback is that it always incurrs a substantial syscall overhead compared to malloc. You decide which you prefer (I think I would lean slightly toward mmap() for long lived applications, but not by much, since the syscall introduces a lot of variance and an average performance degradation). Yep. I agree. I would avoid it if possible. But I know of no other sure memory-returning call! I see that most (all?) of the allocs I know just keep everything allocated and never returned. How about adding this into the code? I think the most obvious replacement is just using an if "tree": if (size>0xff) bucket+=8, size&=0xff; if (size>0xf) bucket+=4, size&0xf; ... it takes a minute to get the math right, but the performance gain should be substantial. Well, I can test that allright. I have the feeling that a tight loop as that (will mostly sping 5-12 times) gets well compiled in machine code, but it is better to test. In my tests, due to the frequency of calls of these functions they contribute 10% to 15% overhead in performance. Yes. That is what I was also getting. OTOH, the speed difference between VT and zippy was sometimes several orders of magnitude so I simply ignored that. Ha! It is pretty simple: you can atomically check pointer equivalence without risking a core (at least this is my experience). You are not expected to make far-reaching decisions based on it, though. In this particular example, even if the test was false, there would be no "harm" done, just an inoptimal path would be selected. I have marked that "Dirty read" to draw people attention on that place. And, I succeeded obviously :-) The dirty read I have no problem with. It's the the possibility of taking of the head element which could be placed there by another thread that bothers me. Ah, this will not happen. As, I take the global mutex at that point so the pagePtr->p_cachePtr cannot be changed under our feet. If that block was allocated by the current thread, the p_cachePtr will not be changed by anybody. So no harm. If it is not, then we must lock the global mutex to prevent anybody fiddling with that element. It is tricky but it should work. It sounds like you are in the best position to test this change to see if it fixes the "unbounded" growth problem. Yes! Indeed. The only thing I'd have to check is how much more memory this will take. But is certainly worth trying it out as it will be a temp relief to our users until we stress test the VT to the max so I can include it in our standard distro. -- --- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php? page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel
Re: [naviserver-devel] Quest for malloc
> a) > The test program Zoran includes biases Zippy toward "standard" > allocator, which it does not do for VT. The following patch > "corrects" this behavior: > > +++ memtest.c Sun Jan 14 16:43:23 2007 > @@ -211,6 +211,7 @@ > } else { > size &= 0x3FFF; /* Limit to 16K */ > } > + if (size>16000) > size = 16000; > *toallocptr++ = size; > } > } > First of all, I wanted to give Zippy a fair chance. If I increase the max allocation size, Zippy becomes even more slow than it is. And, Zippy handles 16K pages, whereas we handle 32K pages. Hence the size &= 0x3FFF; /* Limit to 16K */ which limits the allocation size to 16K max. To increase that would even more hit Zippy than us. Zoran, I believe you misunderstood. The "patch" above limits blocks allocated by your tester to 16000 instead of 16384 blocks. The reason for this is that Zippy's "largest bucket" is configured to be 16284-sizeof(Block) bytes (note the "2" in 16_2_84 is _NOT_ a typo). By making uniformly random requests sizes up to 16_3_84, you are causing Zippy to fall back to system malloc for a small fraction of requests, substantially penalizing its performance in these cases. > The following patch allows Zippy to be a lot less aggressive in > putting blocks into the shared pool, bringing the performance of Zippy > much closer to VT, at the expense of substantially higher memory > "waste": > > @@ -128,12 +174,12 @@ > { 64, 256, 128, NULL}, > { 128, 128, 64, NULL}, > { 256, 64, 32, NULL}, > -{ 512, 32, 16, NULL}, > -{ 1024, 16, 8, NULL}, > -{ 2048,8, 4, NULL}, > -{ 4096,4, 2, NULL}, > -{ 8192,2, 1, NULL}, > -{16284,1, 1, NULL}, > +{ 512, 64, 32, NULL}, > +{ 1024, 64, 32, NULL}, > +{ 2048, 64, 32, NULL}, > +{ 4096, 64, 32, NULL}, > +{ 8192, 64, 32, NULL}, > +{16284, 64, 32, NULL}, > I cannot comment on that. Possibly you are right but I do not see much benefit of that except speeding up Zippy to be on pair with VT, whereas most important VT feature is not the speed, it is the memory handling. You wanted to know why Zippy is slower on your test, this is the reason. This has substantial impact on FreeBSD and linux, and my guess is that it will have a drammatic effect on Mac OSX. > VT releases the memory held in a thread's > local pool when a thread terminates. Since it uses mmap by default, > this means that de-allocated storage is actually released to the > operating system, forcing new threads to call mmap() again to get > memory, thereby incurring system call overhead that could be avoided > in some cases if the system malloc implementation did not lower the > sbrk point at each deallocation. Using malloc() in VT allocator > should give it much more uniform and consisent performance. Not necessarily. We'd shoot ourselves in the foot by doing so, because most OS allocators never return memory to the system and one of our major benefits will be gone. What we could do: timestamp each page, return all pages to the global cache and prune older. Or, put a size constraint on the global cache. But then you'd have yet-another-knob to adjust and the difficulty would be to find the right setup. VT is more simple in that as it does not offer you ANY knobs you can trim (for better or for worse). In some early stages of the design we had number of knobs and were not certain how to adjust them. So we threw that away and redesigned all parts to be "self adjusting" if possible. The benefit of mmap() is being able to "for sure" release memory back to the system. The drawback is that it always incurrs a substantial syscall overhead compared to malloc. You decide which you prefer (I think I would lean slightly toward mmap() for long lived applications, but not by much, since the syscall introduces a lot of variance and an average performance degradation). > e) > Both allocators use an O(n) algorithm to compute the power of two > "bucket" for the allocated size. This is just plain silly since an > O(n log n) algorithm will ofer non-negligible speed up in both > allocators. This is the current O(n) code: > while (bucket < NBUCKETS && globalCache.sizes > [bucket].blocksize < size) { > ++bucket; > } How about adding this into the code? I think the most obvious replacement is just using an if "tree": if (size>0xff) bucket+=8, size&=0xff; if (size>0xf) bucket+=4, size&0xf; ... it takes a minute to get the math right, but the performance gain should be substantial. > f) > Zippy uses Ptr2Block and Block2Ptr functions where as VT uses macros > for this. Zippy also does more checks on MAGIC numbers on each > allocation which VT only performs on de-allocation. I am not sure if > current compilers are smart enough to inline the functions in Zippy, I > did not test this. When compiled with
Re: [naviserver-devel] Quest for malloc
On 1/15/07, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote: Nobody yet gave any reasonable explanation why we are that fast on Mac OSX compared to any other allocator. Recall, that was 870.573/70.713.324 ops/sec Zippy/VT so about 81 times faster, for 16 threads. Although it really seems like a bug either in the testcode or in the allocator, I have not been able to verify any. All is working as it should. So, the mistery remains... Because Mac OSX SucksMonkeyBawlz() in a tight inner loop? The engineers at Apple have many fine achievements, but this kind of system level performance isn't one of them. All the benchmarks I've ever seen show that SucksMonkeyBawlz() is sprinkled throughout the code responsible for locking, context switching, memory allocation, etc. So, don't be surprised. Enjoy the drop shadows!
Re: [naviserver-devel] Quest for malloc
I've been running new allocator for several weeks now on busy Naviserver, memory does not grow anymore, once threads exit it returns back and no crashes have been observed. Zoran Vasiljevic wrote: Am 19.12.2006 um 20:42 schrieb Stephen Deasey: Zoran will be happy... :-) Zoran is again happy to put the next small update of the (famous) VT malloc on: http://www.archiware.com/downloads/vtmalloc-0.0.2.tar.gz For the list of changes since 0.0.1, please look in the ChangeLog file. As it seems, we are still pretty fast and, thanks to Mike, we know why Zippy is that slow when exposed to our memtest program. Nobody yet gave any reasonable explanation why we are that fast on Mac OSX compared to any other allocator. Recall, that was 870.573/70.713.324 ops/sec Zippy/VT so about 81 times faster, for 16 threads. Although it really seems like a bug either in the testcode or in the allocator, I have not been able to verify any. All is working as it should. So, the mistery remains... Cheers Zoran - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/
Re: [naviserver-devel] Quest for malloc
Am 19.12.2006 um 20:42 schrieb Stephen Deasey: Zoran will be happy... :-) Zoran is again happy to put the next small update of the (famous) VT malloc on: http://www.archiware.com/downloads/vtmalloc-0.0.2.tar.gz For the list of changes since 0.0.1, please look in the ChangeLog file. As it seems, we are still pretty fast and, thanks to Mike, we know why Zippy is that slow when exposed to our memtest program. Nobody yet gave any reasonable explanation why we are that fast on Mac OSX compared to any other allocator. Recall, that was 870.573/70.713.324 ops/sec Zippy/VT so about 81 times faster, for 16 threads. Although it really seems like a bug either in the testcode or in the allocator, I have not been able to verify any. All is working as it should. So, the mistery remains... Cheers Zoran
Re: [naviserver-devel] Quest for malloc
Am 15.01.2007 um 10:27 schrieb Mike: Although not entirely sure why, I have spent some time analyzing the behavior and code of both of these allocators. Well, I'd say it is simple "why": the results I have presented are just too "tempting" so you wanted really to know *why*. This is normal. I'd do the same. a) The test program Zoran includes biases Zippy toward "standard" allocator, which it does not do for VT. The following patch "corrects" this behavior: +++ memtest.c Sun Jan 14 16:43:23 2007 @@ -211,6 +211,7 @@ } else { size &= 0x3FFF; /* Limit to 16K */ } + if (size>16000) size = 16000; *toallocptr++ = size; } } First of all, I wanted to give Zippy a fair chance. If I increase the max allocation size, Zippy becomes even more slow than it is. And, Zippy handles 16K pages, whereas we handle 32K pages. Hence the size &= 0x3FFF; /* Limit to 16K */ which limits the allocation size to 16K max. To increase that would even more hit Zippy than us. b) The key difference between Zippy and VT allocators arises form their use of the shared "freed" memory pool. Zippy calls this the shared cache, VT calls this the global cache. Zippy's goal appears to have been to minimize memory usage (while the stated goal is to reduce lock contention). Zippy does this by aggressively moving freed blocks to the shared cache, allowing any thread to later allocate memory from this shared pool. Meanwhile VT targets speed, trading off bloat, and allowing freed blocks to return to the private per-thread pools. To allow for this speed optimization, VT keeps a pointer to the cache that allocated it within each "page", something that can be done for Zippy if speed was the goal. Hmhm... Still, our intention is to be more conservative in *overall* memory usage. That means, I'm prepared to give myself more memory if I can be faster with that *temporarily* (after all modern systems have huge memory banks) but I would not like to be greedy and keep that memory for myself all the time. Which is precisely what VT does: it is more memory hungry in terms of temporarily allocated memory (although not that significant for this to be a problem) but it is social-enough to release that when not needed any more. c) The key reason why Zippy substantially lags behind VT in performance is actually because Zippy beats itself at its own game. While it's stated goal is to minimize lock contention, the hardcoded constants used in Zippy actually completely sacrifice lock contention for storage. Naturally, thread-local pools can be used to allocate blocks immediately, while the shared pool must be locked by a mutex when allocation is performed. The current Zippy configuration minimizes the amount of storage "wasted" in per-thread pools by aggressively moving larger blocks to the shared cache. The more threads attempt to allocate/free large blocks, the worse the contention and the lower the performance. Zoran's test program produces allocation sizes that are uniform random, so large blocks are equally likely to small blocks, therefore performance suffers substantially. A more accurate benchmark would take common usage patterns from Tcl/NaviServer, which I suspect are heavily biased toward allocation of small objects. If you can modify memtest.c to be like that I'd have nothing against! Actually, we have no problems with small allocations nor with large ones as they are all handled by the same mechanism. In Zippy large allocations (over 16K) are just handled with the system malloc with all trade-offs that this brings. The following patch allows Zippy to be a lot less aggressive in putting blocks into the shared pool, bringing the performance of Zippy much closer to VT, at the expense of substantially higher memory "waste": @@ -128,12 +174,12 @@ { 64, 256, 128, NULL}, { 128, 128, 64, NULL}, { 256, 64, 32, NULL}, -{ 512, 32, 16, NULL}, -{ 1024, 16, 8, NULL}, -{ 2048,8, 4, NULL}, -{ 4096,4, 2, NULL}, -{ 8192,2, 1, NULL}, -{16284,1, 1, NULL}, +{ 512, 64, 32, NULL}, +{ 1024, 64, 32, NULL}, +{ 2048, 64, 32, NULL}, +{ 4096, 64, 32, NULL}, +{ 8192, 64, 32, NULL}, +{16284, 64, 32, NULL}, I cannot comment on that. Possibly you are right but I do not see much benefit of that except speeding up Zippy to be on pair with VT, whereas most important VT feature is not the speed, it is the memory handling. d) VT uses mmap by default to allocate memory, Zippy uses the system malloc. By doing this, VT actually penalizes itself in an environment where lots of small blocks are frequently allocated and threads are often created/destroyed. Partly right. Lots of small blocks is no problem. We allocate 32K pages that yields 2048 16-byte blocks, 1024 32-byte blocks etc. So, small allocations are
Re: [naviserver-devel] Quest for malloc
Vlad has written an allocator that uses mmap to obtain memory for the system and munmap that memory on thread exit, if possible. I have spent more than 3 weeks fiddling with that and discussing it with Vlad and this is what we bith come to: http://www.archiware.com/downloads/vtmalloc-0.0.1.tar.gz I believe we have solved most of my needs. Below is an excerpt from the README file for the qurious. If anybody would care to test it in his/her own environment? If all goes well, I might TIP this to be included in Tcl core as replacement of (or addition to) the zippy allocator. Although not entirely sure why, I have spent some time analyzing the behavior and code of both of these allocators. Since I don't really want to spend too much more, the following comments are not organized in any particulr order of importance or relevance... a) The test program Zoran includes biases Zippy toward "standard" allocator, which it does not do for VT. The following patch "corrects" this behavior: +++ memtest.c Sun Jan 14 16:43:23 2007 @@ -211,6 +211,7 @@ } else { size &= 0x3FFF; /* Limit to 16K */ } + if (size>16000) size = 16000; *toallocptr++ = size; } } b) The key difference between Zippy and VT allocators arises form their use of the shared "freed" memory pool. Zippy calls this the shared cache, VT calls this the global cache. Zippy's goal appears to have been to minimize memory usage (while the stated goal is to reduce lock contention). Zippy does this by aggressively moving freed blocks to the shared cache, allowing any thread to later allocate memory from this shared pool. Meanwhile VT targets speed, trading off bloat, and allowing freed blocks to return to the private per-thread pools. To allow for this speed optimization, VT keeps a pointer to the cache that allocated it within each "page", something that can be done for Zippy if speed was the goal. c) The key reason why Zippy substantially lags behind VT in performance is actually because Zippy beats itself at its own game. While it's stated goal is to minimize lock contention, the hardcoded constants used in Zippy actually completely sacrifice lock contention for storage. Naturally, thread-local pools can be used to allocate blocks immediately, while the shared pool must be locked by a mutex when allocation is performed. The current Zippy configuration minimizes the amount of storage "wasted" in per-thread pools by aggressively moving larger blocks to the shared cache. The more threads attempt to allocate/free large blocks, the worse the contention and the lower the performance. Zoran's test program produces allocation sizes that are uniform random, so large blocks are equally likely to small blocks, therefore performance suffers substantially. A more accurate benchmark would take common usage patterns from Tcl/NaviServer, which I suspect are heavily biased toward allocation of small objects. The following patch allows Zippy to be a lot less aggressive in putting blocks into the shared pool, bringing the performance of Zippy much closer to VT, at the expense of substantially higher memory "waste": @@ -128,12 +174,12 @@ { 64, 256, 128, NULL}, { 128, 128, 64, NULL}, { 256, 64, 32, NULL}, -{ 512, 32, 16, NULL}, -{ 1024, 16, 8, NULL}, -{ 2048,8, 4, NULL}, -{ 4096,4, 2, NULL}, -{ 8192,2, 1, NULL}, -{16284,1, 1, NULL}, +{ 512, 64, 32, NULL}, +{ 1024, 64, 32, NULL}, +{ 2048, 64, 32, NULL}, +{ 4096, 64, 32, NULL}, +{ 8192, 64, 32, NULL}, +{16284, 64, 32, NULL}, d) VT uses mmap by default to allocate memory, Zippy uses the system malloc. By doing this, VT actually penalizes itself in an environment where lots of small blocks are frequently allocated and threads are often created/destroyed. VT releases the memory held in a thread's local pool when a thread terminates. Since it uses mmap by default, this means that de-allocated storage is actually released to the operating system, forcing new threads to call mmap() again to get memory, thereby incurring system call overhead that could be avoided in some cases if the system malloc implementation did not lower the sbrk point at each deallocation. Using malloc() in VT allocator should give it much more uniform and consisent performance. Using mmap() in Zippy has less performnace impact since memory is never released by Zippy (at thread termintion it is just placed back into the shared pool). Another obvious downside of using mmap() for Zippy is that realloc() must always fall back to the slow allocate/copy/free mechanism and can never be optimized. e) Both allocators use an O(n) algorithm to compute the power of two "bucket" for the allocated size. This is just plain silly since an O(n log n) algorithm will ofer non-negligible speed up in both allocators. This is
Re: [naviserver-devel] Quest for malloc
Am 13.01.2007 um 10:45 schrieb Gustaf Neumann: Fault was, that i did not read the README (i read the frist one) and compiled (a) without -DTCL_THREADS . In that case, fault was that on FreeBSD you need to explictly put "-pthread" when linking the test program, regardless of the fact that libtcl8.4.so was already linked with it. That, only, did the trick. Speed was (as expected and still not clear why) at least 2 times better than anything else. In some rough cases it was _significantly_ faster. But... I believe we should not fixate ourselves to the speed of the allocator. It was not our intention to make something faster. Our intention was to release memory early enough so we don't bloat the system as a long-running process. I admit, speed of the code is always the most interesting and tempting issue for engineers, but in this case it was really the memory savings for long-running programs that we were after. Having said that, I must again repeat that we'd like to get some field-experience with the allocator before we do any further steps. This means that we are thankful for any feedback. Cheers, zoran
Re: [naviserver-devel] Quest for malloc
On 1/13/07, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote: Am 13.01.2007 um 06:17 schrieb Mike: > I'm happy to offer ssh access to a test > box where you can reproduce these results. Oh, that is very fine! Can you give me the access data? You can post me the login-details in a separate private mail. Zoran, Tried to contact you, but did not receive reply. Check your spam filter/email me.
Re: [naviserver-devel] Quest for malloc
Am 13.01.2007 um 10:45 schrieb Gustaf Neumann: PPS: strangly, the only think making me supicious is the huge amount of improvement, especially on Mac OS X. Look... Running the test program unmodified (on Mac Pro box): Test Tcl allocator with 4 threads, 16000 records ... This allocator achieves 35096360 ops/sec under 4 threads Press return to exit (observe the current memory footprint!) If I modify the memtest.c program at line 146 to read: if (dorealloc && (allocptr > tdata[tid].allocs) && (r & 1)) { allocptr[-1] = reallocs[whichmalloc](allocptr[-1], *toallocptr); } else { allocptr[0] = mallocs[whichmalloc](*toallocptr); /*-->*/ memset(allocptr[0], 0, *toallocptr > 64 ? 64 : *toallocptr); allocptr++; } Test Tcl allocator with 4 threads, 16000 records ... This allocator achieves 28377808 ops/sec under 4 threads Press return to exit (observe the current memory footprint!) If I memset the whole memory area, not just first 64 bytes: Test Tcl allocator with 4 threads, 16000 records ... This allocator achieves 14862477 ops/sec under 4 threads Press return to exit (observe the current memory footprint!) BUT, guess what! The system allocator gives me (using same test data i.e. memsetting the whole allocated chunk): Test standard allocator with 4 threads, 16000 records ... This allocator achieves 869716 ops/sec under 4 threads Press return to exit (observe the current memory footprint!) So we are still: 14862477/869716 = 17 times faster. With increasing thread count we get faster and faster whereas system allocator stays at the same (low) level or is getting slower. Now, I would really like to know why! Perhaps the fact that we are using mmap() instead of god-knows-what Apple is using... Anyways... either we have some very big error there (in which case I'd like to know where, as everything is working as it should!) or we have found much better way to handle memory on Mac OSX :-) Cheers Zoran
Re: [naviserver-devel] Quest for malloc
Am 13.01.2007 um 10:45 schrieb Gustaf Neumann: correcting these configuration isses, the program works VERY well i tried on 32bit and 64bit machines (a minor complaint in the memtest program, casting 32bit int to ClientData and vice versa) Where exactly so I can fix that? -gustaf PS: i could get access to an 64-bit amd FreeBSD machine on monday, if there is still need... Well, I could use some hands-on experience... Please send me some login data so I can check it out! PPS: strangly, the only think making me supicious is the huge amount of improvement, especially on Mac OS X. I can't remember in my experience having seen such a drastical performance increase by a realitive litte code change, especially in an area, which is usually carefully fine-tuned, and where many CS grads from all over the world writing their thesis on This *is* the only fact that's *really* puzzling me so much. I cannot explain that at all. I was also stepping the whole thing with the debugger because I thought there must be some error somewhere, but I found none! Then I thought it is the test program that does something weird. But it's not! The test program just happily allocates random chunks between 16 and 16384 bytes and then releases them. No big magic. And certainly not something of a rocket science. So the mistery remains... Moreover, the Tcl allocator seems to suck greatly when exposed to such test on Mac OSX. Also not very explainable. The only thing I noticed is: Tcl allocator uses MUCH system time, whereas our alloc uses close-to-none. That would point that it does lots of locking (I cannot imagine it would do anything else system-related) whereas our alloc uses close to none locking (that is, when the memory is allocated and freed in the same thread, which is what happens 99% of the time). I would recommend that Vlad and Zoran should write a technical paper about the new allocator and analyze the properties and differences If I could ever get time for that! What we could write is a more in-depth explanation how it works (it is actually very simple). Cheers Zoran
Re: [naviserver-devel] Quest for malloc
I downloaded the code in the previous mail. After some minor path adjustments, I was able to get the test program to compile and link under FreeBSD 6.1 running on a dual-processor PIII system, linked against a threaded tcl 8.5a. I could get this program to consistently do one of two things: - dump core - hang seemingly forever but absolutely nothing else. Mike, when zoran annouced the version, i downloaded it and had similar expericences. Fault 1 turned out to be: The link of Zoran lead to a premature version of the software, not the real thing (the right version is untarred to a directory containing the verion numbers). Then zoran corrected the link, i refetched, and .. well no makefile. Just complile and try: same effect. Fault was, that i did not read the README (i read the frist one) and compiled (a) without -DTCL_THREADS . i had exectly the same symptoms. correcting these configuration isses, the program works VERY well i tried on 32bit and 64bit machines (a minor complaint in the memtest program, casting 32bit int to ClientData and vice versa) -gustaf PS: i could get access to an 64-bit amd FreeBSD machine on monday, if there is still need... PPS: strangly, the only think making me supicious is the huge amount of improvement, especially on Mac OS X. I can't remember in my experience having seen such a drastical performance increase by a realitive litte code change, especially in an area, which is usually carefully fine-tuned, and where many CS grads from all over the world writing their thesis on I would recommend that Vlad and Zoran should write a technical paper about the new allocator and analyze the properties and differences
Re: [naviserver-devel] Quest for malloc
Am 13.01.2007 um 06:17 schrieb Mike: Running this program under the latest version of valgrind (using memcheck or helgrind tools) reveals numerous errors from valgrind, which I suspect (although I did not confirm) are the reason for the core dumps and infinite hangs when it is run on its own. Even more interesting... I just gave it a Purify run on Solaris 2.8 with 4 threads and it revealed absolutely no problems nor leaks. Heh? Can it be that the problem is not the alloc code but the tcl 8.5 alpha that you linked against? I never tested anything else then 8.4.14. Please be aware that I haven't touched the 8.5 tree up to now so there could be some problems there, as there have been lots of changes in tcl head branch lately. To save your and my time, an access to your box where I can verify that odd behaviour that you're reporting will be very helpful! Cheers Zoran
Re: [naviserver-devel] Quest for malloc
Am 13.01.2007 um 06:17 schrieb Mike: I'm happy to offer ssh access to a test box where you can reproduce these results. Oh, that is very fine! Can you give me the access data? You can post me the login-details in a separate private mail. Thanks, Zoran
Re: [naviserver-devel] Quest for malloc
Am 13.01.2007 um 06:17 schrieb Mike: I downloaded the code in the previous mail. After some minor path adjustments, I was able to get the test program to compile and link under FreeBSD 6.1 running on a dual-processor PIII system, linked against a threaded tcl 8.5a. I could get this program to consistently do one of two things: - dump core - hang seemingly forever but absolutely nothing else. Running this program under the latest version of valgrind (using memcheck or helgrind tools) reveals numerous errors from valgrind, which I suspect (although I did not confirm) are the reason for the core dumps and infinite hangs when it is run on its own. Hey, it is the first time *ever* it got to the public, so do not expect mission-critical bullet-proof code! No wonder there are still errors there, but those are to be fixed, of course. After all, at least two persons (myself and Vlad) are going to include this work in production system(s). So it needs much tests, of course Thank you for taking a look at it. If you'd like to help a bit... compile the Tcl with --enable-symbols and hit it again until it crashes. Then inspect the core with the debugger and give me the stack trace of the crashing thread. And, generally speaking... I would not spent time on that if that's avoidable. Show me a good memory conservative allocator that is fast enough and returns memory to the system and works accross Linux, Solaris, Mac OSX and Windows? To my knowledge, there is none. During all this (testing and developing) time, I found the Solaris alloc to be the most-appropriate, but still, this one also grabs all the memory it can and never releases it back! So, the question is not that I'd like some exercise in writing memory allocators. I don't. I have *plenty* of other work on my back. But we happen to have a product out there (already 1000+ installations worldwide) that needs a reboot each day because of the way it consumes system memory. Not leaks. Regular consumption. So I have a very pressing need to undertake something in this direction, if you understand what I mean. Now if you can get me some debug data from your box so I can check what is going on, that would be very nice! Cheers Zoran
Re: [naviserver-devel] Quest for malloc
I've been on a search for an allocator that will be fast enough and not so memory hungry as the allocator being built in Tcl. Unfortunately, as it mostly is, it turned out that I had to write my own. Vlad has written an allocator that uses mmap to obtain memory for the system and munmap that memory on thread exit, if possible. I have spent more than 3 weeks fiddling with that and discussing it with Vlad and this is what we bith come to: http://www.archiware.com/downloads/vtmalloc-0.0.1.tar.gz I believe we have solved most of my needs. Below is an excerpt from the README file for the qurious. If anybody would care to test it in his/her own environment? If all goes well, I might TIP this to be included in Tcl core as replacement of (or addition to) the zippy allocator. Zoran, Because I am quite biased here, to avoid later being branded as biased,I want to explicitly state my bias up front: In my experience, very little good comes out of people writing their own memory allocators. There is a small number of people in this world for who this privilege should be reserved (outside of a classroom excercise, of course), and the rest of us humble folk should help them when we can but generally stay out of the way - setting out to reinvent the wheel is not a good thing. I downloaded the code in the previous mail. After some minor path adjustments, I was able to get the test program to compile and link under FreeBSD 6.1 running on a dual-processor PIII system, linked against a threaded tcl 8.5a. I could get this program to consistently do one of two things: - dump core - hang seemingly forever but absolutely nothing else. Running this program under the latest version of valgrind (using memcheck or helgrind tools) reveals numerous errors from valgrind, which I suspect (although I did not confirm) are the reason for the core dumps and infinite hangs when it is run on its own. I have no time to debug this myself, however in the interest of science and general progress, I'm happy to offer ssh access to a test box where you can reproduce these results. I strongly advise against using a benchmark with the above characteristics to make any decisions about speed or memory consumption improvements or problems. --- After toying around with this briefly, I was able to run the test program under valgrind after specifying a -rec value of 1000 or less. Despite some errors reported by valgrind, the test program does run to completion and report its results in these cases. standard allocator: This allocator achieves 43982 ops/sec under 4 threads tcl allocator: This allocator achieves 21251 ops/sec under 4 threads improved tcl allocator: This allocator achieves 21308 ops/sec under 4 threads But again, I would not draw any serious conclusions from these numbers.
Re: [naviserver-devel] Quest for malloc
Am 19.12.2006 um 20:42 schrieb Stephen Deasey: On 12/19/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote: Right, with Ns_ functions it does not crash. Zoran will be happy... :-) In fact, yes! I'm more than happy to announce something that will change the way we use computers in 22 century (if we live enough to witness it) :-) Seriously... I've been on a search for an allocator that will be fast enough and not so memory hungry as the allocator being built in Tcl. Unfortunately, as it mostly is, it turned out that I had to write my own. Vlad has written an allocator that uses mmap to obtain memory for the system and munmap that memory on thread exit, if possible. I have spent more than 3 weeks fiddling with that and discussing it with Vlad and this is what we bith come to: http://www.archiware.com/downloads/vtmalloc-0.0.1.tar.gz I believe we have solved most of my needs. Below is an excerpt from the README file for the qurious. If anybody would care to test it in his/her own environment? If all goes well, I might TIP this to be included in Tcl core as replacement of (or addition to) the zippy allocator. - Compared was performace of OS memory allocator (Standard), Tcl built-in threading allocator (Zippy) and this (VT) allocator. First table shows testing alloc/free operations on 16000 blocks of memory each of random size, between 16 and 16384 bytes. The total number of blocks is divided among threads, so 1 thread operates on 16000 blocks, 2 threads each on 8000, 4 threads each at 4000 blocks etc. For each test, program was run three times and the best value was taken. Speed numbers are in operations/second. More is better. Second table showns memory usage. Values are gathered by peeking at the system "top" utility. The "Top" is peak memory during program run. The "Low" is just before the program exits. Memory usage numbers are (rounded) in MB. Less is better. Machine: Apple Mac Pro, 2 x Intel Core Duo 2.66GHz, 1GB, Mac OSX 10.4.8 | Allocator| 1 threads | 2 threads | 4 threads | 8 threads |16 threads | +==+===+===+===+===+===+ | Standard | 2.316.454 | 2.187.852 | 2.103.777 | 2.108.825 | 2.304.939 | +--+---+---+---+---+---+ | Zippy| 7.111.380 | 3.214.132 | 1.450.300 | 851.347 | 870.573 | +--+---+---+---+---+---+ | VT |25.047.968 |25.438.877 |30.615.718 |48.845.898 |70.713.324 | = | | Top | Low | | Allocator | Resident | Virtual | Resident | Virtual | +---+++++ | Standard | 49 |125 | 49 |112 | +---+++++ | Zippy |102 |182 |102 |182 | +---+++++ | VT| 43 |169 | 1 | 50 | = Machine: Sun Ultra 20, 1 x AMD 2.6GHz, 2GB, Solaris 10 | Allocator| 1 threads | 2 threads | 4 threads | 8 threads |16 threads | +==+===+===+===+===+===+ | Standard | 7.725.757 | 7.940.706 | 8.661.384 | 9.673.767 |11.348.060 | +--+---+---+---+---+---+ | Zippy| 9.375.668 | 9.638.397 |10.044.609 |10.121.013 |10.126.495 | +--+---+---+---+---+---+ | VT |13.539.585 |14.018.716 |14.058.184 |14.287.382 |15.206.398 | = | | Top | Low | | Allocator | Resident | Virtual | Resident | Virtual | +---+++++ | Standard | 67| 97| 67| 97| +---+++++ | Zippy | 128| 153| 128| 153| +---+++++ | VT| 44| 137| 2| 19| = Machine: AMD Athlon XP2200, 1.8GHz, 512MB, Linux Suse9.1 | Allocator| 1 threads | 2 threads | 4 threads | 8 threads |16 threads | +==+===+===+===+===+===
Re: [naviserver-devel] Quest for malloc
On linux Tcl version of the test just crashes constantly in free, i have no other OSes here (gdb) bt #0 0xb7f0d410 in ?? () #1 0xb6cd1b78 in ?? () #2 0x0006 in ?? () #3 0x3746 in ?? () #4 0xb7d3f731 in raise () from /lib/libc.so.6 #5 0xb7d40f08 in abort () from /lib/libc.so.6 #6 0xb7d74e7b in __libc_message () from /lib/libc.so.6 #7 0xb7d7ab10 in malloc_printerr () from /lib/libc.so.6 #8 0xb7d7c1a9 in free () from /lib/libc.so.6 #9 0x080485d9 in MemThread (arg=0x0) at ttest.c:33 #10 0xb7e8943f in NewThreadProc (clientData=0x804a358) at /home/vlad/src/ossweb/external/archlinux/tcl/src/tcl8.4.14/unix/../generic/tclEvent.c:1229 #11 0xb7d014a2 in start_thread () from /lib/libpthread.so.0 #12 0xb7dd5ede in clone () from /lib/libc.so.6 Zoran Vasiljevic wrote: On 19.12.2006, at 20:42, Stephen Deasey wrote: On 12/19/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote: Right, with Ns_ functions it does not crash. Zoran will be happy... :-) Not at all! So, I would like to know exactly how to reproduce the problem (what OS, machine, etc). Furthermore I need all your test-code and eventually the gdb trace of the crash, to start with. Can you get all that for me? - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/ #include #include #include #include #include #include #define MemAlloc malloc #define MemFree free static int nbuffer = 16384; static int nloops = 5; static int nthreads = 4; static void *gPtr = NULL; static Tcl_Mutex *gLock = NULL; void MemThread(void *arg) { int i,n; void *ptr = NULL; for (i = 0; i < nloops; ++i) { n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0))); if (ptr != NULL) { MemFree(ptr); } ptr = MemAlloc(n); if (n % 50 == 0) { Tcl_MutexLock(&gLock); if (gPtr != NULL) { MemFree(gPtr); gPtr = NULL; } else { gPtr = MemAlloc(n); } Tcl_MutexUnlock(&gLock); } } } int main (int argc, char **argv) { int i; Tcl_ThreadId *tids; tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads); for (i = 0; i < nthreads; ++i) { Tcl_CreateThread( &tids[i], MemThread, NULL, TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE); } for (i = 0; i < nthreads; ++i) { Tcl_JoinThread(tids[i], NULL); } }
Re: [naviserver-devel] Quest for malloc
On 19.12.2006, at 20:42, Stephen Deasey wrote: On 12/19/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote: Right, with Ns_ functions it does not crash. Zoran will be happy... :-) Not at all! So, I would like to know exactly how to reproduce the problem (what OS, machine, etc). Furthermore I need all your test-code and eventually the gdb trace of the crash, to start with. Can you get all that for me?
Re: [naviserver-devel] Quest for malloc
On 12/19/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote: Right, with Ns_ functions it does not crash. Zoran will be happy... :-)
Re: [naviserver-devel] Quest for malloc
Right, with Ns_ functions it does not crash. Stephen Deasey wrote: On 12/19/06, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote: On 19.12.2006, at 17:08, Vlad Seryakov wrote: I converted all to use pthreads directly instead of Tcl wrappers, and now it does not crash anymore. Will continue testing but it looks like Tcl is the problem here, not ptmalloc Where does it crash? I see you are just using Tcl_CreateThread Tcl_MutexLock/Unlock Tcl_JoinThread Those just fallback to underlying pthread lib. It makes no real sense. I believe. Simply loading the Tcl library initialises a bunch of thread stuff, right? Also, the Tcl mutexes are self initialising, which includes calling down into the global Tcl mutex. Lots of stuff going on behind the scenes... NaviServer mutexes are also self initialising, but they call down to the pthread_ functions without touching any Tcl code, which may explain why the server isn't crashing all the time. So here's a test: what happens when you compile the test program to use Ns_Mutex and Ns_ThreadCreate etc.? Pthreads work, Tcl doesn't, how about NaviServer? - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/ /* * gcc -I/usr/local/ns/include -g ttest.c -o ttest -lpthread /usr/local/ns/lib/libnsthread.so * */ #include #include #include #include #include #include #define MemAlloc malloc #define MemFree free static int nbuffer = 16384; static int nloops = 15; static int nthreads = 12; static void *gPtr = NULL; static Ns_Mutex gLock; void MemThread(void *arg) { int i,n; void *ptr = NULL; for (i = 0; i < nloops; ++i) { n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0))); if (ptr != NULL) { MemFree(ptr); } ptr = MemAlloc(n); if (n % 50 == 0) { Ns_MutexLock(&gLock); if (gPtr != NULL) { MemFree(gPtr); gPtr = NULL; } else { gPtr = MemAlloc(n); } Ns_MutexUnlock(&gLock); } } } int main (int argc, char **argv) { int i; Ns_Thread *tids; if (argc > 1) { nthreads = atoi(argv[1]); } if (argc > 2) { nloops = atoi(argv[2]); } if (argc > 3) { nbuffer = atoi(argv[3]); } tids = (Ns_Thread *)malloc(sizeof(Tcl_ThreadId) * nthreads); for (i = 0; i < nthreads; ++i) { Ns_ThreadCreate(MemThread, 0, 0, &tids[i]); } for (i = 0; i < nthreads; ++i) { Ns_ThreadJoin(&tids[i], NULL); } }
Re: [naviserver-devel] Quest for malloc
On 12/19/06, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote: On 19.12.2006, at 17:08, Vlad Seryakov wrote: > I converted all to use pthreads directly instead of Tcl wrappers, and > now it does not crash anymore. Will continue testing but it looks like > Tcl is the problem here, not ptmalloc Where does it crash? I see you are just using Tcl_CreateThread Tcl_MutexLock/Unlock Tcl_JoinThread Those just fallback to underlying pthread lib. It makes no real sense. I believe. Simply loading the Tcl library initialises a bunch of thread stuff, right? Also, the Tcl mutexes are self initialising, which includes calling down into the global Tcl mutex. Lots of stuff going on behind the scenes... NaviServer mutexes are also self initialising, but they call down to the pthread_ functions without touching any Tcl code, which may explain why the server isn't crashing all the time. So here's a test: what happens when you compile the test program to use Ns_Mutex and Ns_ThreadCreate etc.? Pthreads work, Tcl doesn't, how about NaviServer?
Re: [naviserver-devel] Quest for malloc
I have no idea, i spent too much time on this still without realizing what i am doing and what to expect :-))) Zoran Vasiljevic wrote: On 19.12.2006, at 17:08, Vlad Seryakov wrote: I converted all to use pthreads directly instead of Tcl wrappers, and now it does not crash anymore. Will continue testing but it looks like Tcl is the problem here, not ptmalloc Where does it crash? I see you are just using Tcl_CreateThread Tcl_MutexLock/Unlock Tcl_JoinThread Those just fallback to underlying pthread lib. It makes no real sense. I believe. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/
Re: [naviserver-devel] Quest for malloc
On 19.12.2006, at 17:08, Vlad Seryakov wrote: I converted all to use pthreads directly instead of Tcl wrappers, and now it does not crash anymore. Will continue testing but it looks like Tcl is the problem here, not ptmalloc Where does it crash? I see you are just using Tcl_CreateThread Tcl_MutexLock/Unlock Tcl_JoinThread Those just fallback to underlying pthread lib. It makes no real sense. I believe.
Re: [naviserver-devel] Quest for malloc
I converted all to use pthreads directly instead of Tcl wrappers, and now it does not crash anymore. Will continue testing but it looks like Tcl is the problem here, not ptmalloc Stephen Deasey wrote: On 12/19/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote: yes, it crashes when number of threads are more than 1 with any size but not all the time, sometimes i need to run it several times, looks like it is random, some combination, not sure of what. I guess we never got that high concurrency in Naviserver, i wonder if AOL has randomm crashes. You're still using Tcl threads. Strip it out. Make the loops and bock size command line parameters. If you think you've found a bug you'll want the most concise test case so you can report it to the glibc maintainers. #glibc on irc.freenode.net - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/
Re: [naviserver-devel] Quest for malloc
On 12/19/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote: yes, it crashes when number of threads are more than 1 with any size but not all the time, sometimes i need to run it several times, looks like it is random, some combination, not sure of what. I guess we never got that high concurrency in Naviserver, i wonder if AOL has randomm crashes. You're still using Tcl threads. Strip it out. Make the loops and bock size command line parameters. If you think you've found a bug you'll want the most concise test case so you can report it to the glibc maintainers. #glibc on irc.freenode.net
Re: [naviserver-devel] Quest for malloc
On 19.12.2006, at 16:35, Vlad Seryakov wrote: yes, it crashes when number of threads are more than 1 with any size but not all the time, sometimes i need to run it several times, looks like it is random, some combination, not sure of what. I guess we never got that high concurrency in Naviserver, i wonder if AOL has randomm crashes. Concurrency or not, I'm running it on a fastest mac you can buy and tweak to 16 threads and increase loop from 5 to 50 and get this: (with nedmalloc) Blitzer:~/nedmalloc_tcl root# time ./tcltest real0m2.036s user0m4.652s sys 0m1.823s (with standard malloc) Blitzer:~/nedmalloc_tcl root# time ./tcltest real0m9.140s user0m17.319s sys 0m17.397s So that's about 4 times faster. I cannot reproduce any crash, whatever I try.
Re: [naviserver-devel] Quest for malloc
yes, it crashes when number of threads are more than 1 with any size but not all the time, sometimes i need to run it several times, looks like it is random, some combination, not sure of what. I guess we never got that high concurrency in Naviserver, i wonder if AOL has randomm crashes. Stephen Deasey wrote: Is this really the shortest test case you can make for this problem? - Does it crash if you allocate blocks of size 1024 rather than random size? Does for me. Strip it out. - Does it crash if you run 2 threads instead of 4? Does for me. Strip it out. Some times it crashes, some times it doesn't. Clearly it's timing related. The root cause is not going to be identified by injecting a whole bunch of random! Make this program shorter. On 12/19/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote: I tried nedmalloc with LD_PRELOAD for my little test and it crashed vene before the start. Zoran, can you test it on Solaris and OSX so we'd know that is not Linux related problem. #include #include #include #include #include #include #define MemAlloc malloc #define MemFree free static int nbuffer = 16384; static int nloops = 5; static int nthreads = 4; static void *gPtr = NULL; static Tcl_Mutex gLock; void MemThread(void *arg) { int i,n; void *ptr = NULL; for (i = 0; i < nloops; ++i) { n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0))); if (ptr != NULL) { MemFree(ptr); } ptr = MemAlloc(n); if (n % 50 == 0) { Tcl_MutexLock(&gLock); if (gPtr != NULL) { MemFree(gPtr); gPtr = NULL; } else { gPtr = MemAlloc(n); } Tcl_MutexUnlock(&gLock); } } } int main (int argc, char **argv) { int i; Tcl_ThreadId *tids; tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads); for (i = 0; i < nthreads; ++i) { Tcl_CreateThread( &tids[i], MemThread, NULL, TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE); } for (i = 0; i < nthreads; ++i) { Tcl_JoinThread(tids[i], NULL); } } Zoran Vasiljevic wrote: On 19.12.2006, at 01:10, Stephen Deasey wrote: This program allocates memory in a worker thread and frees it in the main thread. If all free()'s put memory into a thread-local cache then you would expect this program to bloat, but it doesn't, so I guess it's not a problem (at least not on Fedora Core 5). It is also not the case with nedmalloc as it specifically tracks that usage pattern. The block being free'd "knows" to which so-called mspace it belongs regardless which thread free's it. So, I'd say the nedmalloc is OK in this respect. I have given it a purify run and it runs cleanly. Our application is nnoticeably faster on Mac and bloats less. But this is only a tip of the iceberg. We yet have to give it a real stress-test on the field, yet I'm reluctant to do this now and will have to wait for a major release somewhere in spring next year. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/ - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/
Re: [naviserver-devel] Quest for malloc
Is this really the shortest test case you can make for this problem? - Does it crash if you allocate blocks of size 1024 rather than random size? Does for me. Strip it out. - Does it crash if you run 2 threads instead of 4? Does for me. Strip it out. Some times it crashes, some times it doesn't. Clearly it's timing related. The root cause is not going to be identified by injecting a whole bunch of random! Make this program shorter. On 12/19/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote: I tried nedmalloc with LD_PRELOAD for my little test and it crashed vene before the start. Zoran, can you test it on Solaris and OSX so we'd know that is not Linux related problem. #include #include #include #include #include #include #define MemAlloc malloc #define MemFree free static int nbuffer = 16384; static int nloops = 5; static int nthreads = 4; static void *gPtr = NULL; static Tcl_Mutex gLock; void MemThread(void *arg) { int i,n; void *ptr = NULL; for (i = 0; i < nloops; ++i) { n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0))); if (ptr != NULL) { MemFree(ptr); } ptr = MemAlloc(n); if (n % 50 == 0) { Tcl_MutexLock(&gLock); if (gPtr != NULL) { MemFree(gPtr); gPtr = NULL; } else { gPtr = MemAlloc(n); } Tcl_MutexUnlock(&gLock); } } } int main (int argc, char **argv) { int i; Tcl_ThreadId *tids; tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads); for (i = 0; i < nthreads; ++i) { Tcl_CreateThread( &tids[i], MemThread, NULL, TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE); } for (i = 0; i < nthreads; ++i) { Tcl_JoinThread(tids[i], NULL); } } Zoran Vasiljevic wrote: > On 19.12.2006, at 01:10, Stephen Deasey wrote: > >> This program allocates memory in a worker thread and frees it in the >> main thread. If all free()'s put memory into a thread-local cache then >> you would expect this program to bloat, but it doesn't, so I guess >> it's not a problem (at least not on Fedora Core 5). > > It is also not the case with nedmalloc as it specifically > tracks that usage pattern. The block being free'd "knows" > to which so-called mspace it belongs regardless which thread > free's it. > > So, I'd say the nedmalloc is OK in this respect. > I have given it a purify run and it runs cleanly. > Our application is nnoticeably faster on Mac and > bloats less. But this is only a tip of the iceberg. > We yet have to give it a real stress-test on the > field, yet I'm reluctant to do this now and will > have to wait for a major release somewhere in spring > next year. > > > > > - > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > ___ > naviserver-devel mailing list > naviserver-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/naviserver-devel > -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/ - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel
Re: [naviserver-devel] Quest for malloc
I was suspecting Linux malloc, looks like it has problems with high concurrency, i tried to replace MemAlloc/Fre with mmap/munmap, and it crashes as well. #define MemAlloc mmalloc #define MemFree(ptr) mfree(ptr, gSize) void *mmalloc(size_t size) { return mmap(NULL,size,PROT_READ|PROT_WRITE|PROT_EXEC, MAP_ANONYMOUS|MAP_PRIVATE, 0, 0); } void mfree(void *ptr, size_t size) { munmap(ptr, size); } Zoran Vasiljevic wrote: On 19.12.2006, at 16:15, Vlad Seryakov wrote: gdb may slow down concurrency, does it run without gdb, also does it run with solaris malloc? No problems. Runs with malloc and nedmalloc with or w/o gdb. The same on Mac. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/
Re: [naviserver-devel] Quest for malloc
On 19.12.2006, at 16:15, Vlad Seryakov wrote: gdb may slow down concurrency, does it run without gdb, also does it run with solaris malloc? No problems. Runs with malloc and nedmalloc with or w/o gdb. The same on Mac.
Re: [naviserver-devel] Quest for malloc
gdb may slow down concurrency, does it run without gdb, also does it run with solaris malloc? Zoran Vasiljevic wrote: On 19.12.2006, at 16:06, Vlad Seryakov wrote: Yes, please ( I appended the code to the nedmalloc test program and renamed their main to main1) bash-2.03$ gcc -O3 -o tcltest tcltest.c -lpthread -DNDEBUG - DTCL_THREADS -I/usr/local/include -L/usr/local/lib -ltcl8.4g bash-2.03$ gdb ./tcltest GNU gdb 6.0 Copyright 2003 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "sparc-sun-solaris2.8"... (gdb) run Starting program: /space/homes/zv/nedmalloc_tcl/tcltest [New LWP 1] [New LWP 2] [New LWP 3] [New LWP 4] [New LWP 5] [New LWP 6] [New LWP 7] [New LWP 8] [LWP 7 exited] [New LWP 7] [LWP 4 exited] [New LWP 4] [LWP 8 exited] [New LWP 8] Program exited normally. (gdb) quit - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/
Re: [naviserver-devel] Quest for malloc
On 19.12.2006, at 16:06, Vlad Seryakov wrote: Yes, please ( I appended the code to the nedmalloc test program and renamed their main to main1) bash-2.03$ gcc -O3 -o tcltest tcltest.c -lpthread -DNDEBUG - DTCL_THREADS -I/usr/local/include -L/usr/local/lib -ltcl8.4g bash-2.03$ gdb ./tcltest GNU gdb 6.0 Copyright 2003 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "sparc-sun-solaris2.8"... (gdb) run Starting program: /space/homes/zv/nedmalloc_tcl/tcltest [New LWP 1] [New LWP 2] [New LWP 3] [New LWP 4] [New LWP 5] [New LWP 6] [New LWP 7] [New LWP 8] [LWP 7 exited] [New LWP 7] [LWP 4 exited] [New LWP 4] [LWP 8 exited] [New LWP 8] Program exited normally. (gdb) quit
Re: [naviserver-devel] Quest for malloc
Yes, please Zoran Vasiljevic wrote: On 19.12.2006, at 15:57, Vlad Seryakov wrote: Zoran, can you test it on Solaris and OSX so we'd know that is not Linux related problem. I have a Tcl library compiled with nedmalloc and when I link against it and make #define MemAlloc Tcl_Alloc #define MemFree Tcl_Free it runs fine. Shold I make the Solaris test? - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/
Re: [naviserver-devel] Quest for malloc
On 19.12.2006, at 15:57, Vlad Seryakov wrote: Zoran, can you test it on Solaris and OSX so we'd know that is not Linux related problem. I have a Tcl library compiled with nedmalloc and when I link against it and make #define MemAlloc Tcl_Alloc #define MemFree Tcl_Free it runs fine. Shold I make the Solaris test?
Re: [naviserver-devel] Quest for malloc
I tried nedmalloc with LD_PRELOAD for my little test and it crashed vene before the start. Zoran, can you test it on Solaris and OSX so we'd know that is not Linux related problem. #include #include #include #include #include #include #define MemAlloc malloc #define MemFree free static int nbuffer = 16384; static int nloops = 5; static int nthreads = 4; static void *gPtr = NULL; static Tcl_Mutex gLock; void MemThread(void *arg) { int i,n; void *ptr = NULL; for (i = 0; i < nloops; ++i) { n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0))); if (ptr != NULL) { MemFree(ptr); } ptr = MemAlloc(n); if (n % 50 == 0) { Tcl_MutexLock(&gLock); if (gPtr != NULL) { MemFree(gPtr); gPtr = NULL; } else { gPtr = MemAlloc(n); } Tcl_MutexUnlock(&gLock); } } } int main (int argc, char **argv) { int i; Tcl_ThreadId *tids; tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads); for (i = 0; i < nthreads; ++i) { Tcl_CreateThread( &tids[i], MemThread, NULL, TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE); } for (i = 0; i < nthreads; ++i) { Tcl_JoinThread(tids[i], NULL); } } Zoran Vasiljevic wrote: On 19.12.2006, at 01:10, Stephen Deasey wrote: This program allocates memory in a worker thread and frees it in the main thread. If all free()'s put memory into a thread-local cache then you would expect this program to bloat, but it doesn't, so I guess it's not a problem (at least not on Fedora Core 5). It is also not the case with nedmalloc as it specifically tracks that usage pattern. The block being free'd "knows" to which so-called mspace it belongs regardless which thread free's it. So, I'd say the nedmalloc is OK in this respect. I have given it a purify run and it runs cleanly. Our application is nnoticeably faster on Mac and bloats less. But this is only a tip of the iceberg. We yet have to give it a real stress-test on the field, yet I'm reluctant to do this now and will have to wait for a major release somewhere in spring next year. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/
Re: [naviserver-devel] Quest for malloc
On 19.12.2006, at 01:10, Stephen Deasey wrote: This program allocates memory in a worker thread and frees it in the main thread. If all free()'s put memory into a thread-local cache then you would expect this program to bloat, but it doesn't, so I guess it's not a problem (at least not on Fedora Core 5). It is also not the case with nedmalloc as it specifically tracks that usage pattern. The block being free'd "knows" to which so-called mspace it belongs regardless which thread free's it. So, I'd say the nedmalloc is OK in this respect. I have given it a purify run and it runs cleanly. Our application is nnoticeably faster on Mac and bloats less. But this is only a tip of the iceberg. We yet have to give it a real stress-test on the field, yet I'm reluctant to do this now and will have to wait for a major release somewhere in spring next year.
Re: [naviserver-devel] Quest for malloc
On 12/18/06, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote: On 18.12.2006, at 19:57, Stephen Deasey wrote: > > > One thing I wonder about this is, how do requests average out across > all threads? If you set the conn threads to exit after 10,000 > requests, will they all quit at roughly the same time causing an > extreme load on the server? Also, this is only an option for conn > threads. With scheduled proc threads, job threads etc. you get > nothing. > Well, if they all start to exit at the same time, they will serialize at the point where per-thread cache is pushed to the shared pool. I was worried more about things like all the Tcl procs needing to be recompiled in the new interp for the thread, and all the other stuff which is cached. If threads exit regularly, say after 10,000 requests, and the requests average out over all threads, then your site will regularly go down, effectively. It would be nice if we could make sure the thread exits were spread out. Anyway... > I think some people are experiencing fragmentation problems with > ptmalloc -- the Squid and OpenLDAP guys, for example. There's also > the malloc-in-one-thread, free-in-another problem, which if your > threads don't exit is basically a leak. Really a leak? Why? Wouln't that depend on the implementation? Yes, and I thought that was the case with Linux ptmalloc, but maybe I got it wrong or this is old news... This program allocates memory in a worker thread and frees it in the main thread. If all free()'s put memory into a thread-local cache then you would expect this program to bloat, but it doesn't, so I guess it's not a problem (at least not on Fedora Core 5). #include #include #include #include #define MemAlloc malloc #define MemFree free void *gPtr = NULL; static void Thread(void *arg); static void PrintMemUsage(const char *msg); int main (int argc, char **argv) { Tcl_ThreadId tid; int i; PrintMemUsage("start"); for (i = 0; i < 10; ++i) { Tcl_CreateThread(&tid, Thread, NULL, TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE); Tcl_JoinThread(tid, NULL); MemFree(gPtr); gPtr = NULL; } PrintMemUsage("stop"); } static void Thread(void *arg) { assert(gPtr == NULL); gPtr = MemAlloc(1024); assert(gPtr != NULL); } static void PrintMemUsage(const char *msg) { FILE *f; int m; f = fopen("/proc/self/statm", "r"); if (f == NULL) { perror("fopen failed: "); exit(-1); } if (fscanf(f, "%d", &m) != 1) { perror("fscanf failed: "); exit(-1); } fclose(f); printf("%s: %d\n", msg, m); }
Re: [naviserver-devel] Quest for malloc
I suspect something i am doing wrong, but still it crashes and i do not see it why #include #include #include #include #include #include #define MemAlloc malloc #define MemFree free static int nbuffer = 16384; static int nloops = 5; static int nthreads = 4; static void *gPtr = NULL; static Tcl_Mutex gLock; void MemThread(void *arg) { int i,n; void *ptr = NULL; for (i = 0; i < nloops; ++i) { n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0))); if (ptr != NULL) { MemFree(ptr); } ptr = MemAlloc(n); if (n % 50 == 0) { Tcl_MutexLock(&gLock); if (gPtr != NULL) { MemFree(gPtr); gPtr = NULL; } else { gPtr = MemAlloc(n); } Tcl_MutexUnlock(&gLock); } } } int main (int argc, char **argv) { int i; Tcl_ThreadId *tids; tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads); for (i = 0; i < nthreads; ++i) { Tcl_CreateThread( &tids[i], MemThread, NULL, TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE); } for (i = 0; i < nthreads; ++i) { Tcl_JoinThread(tids[i], NULL); } } Stephen Deasey wrote: On 12/18/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote: Still, even without the last free and with mutex around it, it core dumps in free(gPtr) during the loop. OK. Still doesn't mean your program is bug free :-) There's a lot of extra stuff going on in your example program that makes it hard to see what's going on. I simplified it to this: #include #include #include #define MemAlloc ckalloc #define MemFree ckfree void *gPtr = NULL; /* Global pointer to memory. */ void Thread(void *arg) { assert(gPtr != NULL); MemFree(gPtr); gPtr = NULL; } int main (int argc, char **argv) { Tcl_ThreadId tid; int i; for (i = 0; i < 10; ++i) { gPtr = MemAlloc(1024); assert(gPtr != NULL); Tcl_CreateThread(&tid, Thread, NULL, TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE); Tcl_JoinThread(tid, NULL); assert(gPtr == NULL); } } Works for me. I say you can allocate memory in one thread and free it in another. Let me know what the bug turns out to be..! Stephen Deasey wrote: On 12/18/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote: I tried to run this program, it crahses with all allocators on free when it was allocated in other thread. zippy does it as well, i amnot sure how Naviserver works then. I don't think allocate in one thread, free in another is an unusual strategy. Googling around I see a lot of people doing it. There must be some bugs in your program. Here's one: At the end of MemThread() gPtr is checked and freed, but the gMutex is not held. This thread may have finished it's tight loop, but the other 3 threads could still be running. Also, the gPtr is not set to NULL after the free(), leading to a double free when the next thread checks it. #include #define MemAlloc ckalloc #define MemFree ckfree int nbuffer = 16384; int nloops = 5; int nthreads = 4; int gAllocs = 0; void *gPtr = NULL; Tcl_Mutex gLock; void MemThread(void *arg) { int i,n; void *ptr = NULL; for (i = 0; i < nloops; ++i) { n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0))); if (ptr != NULL) { MemFree(ptr); } ptr = MemAlloc(n); // Testing inter-thread alloc/free if (n % 5 == 0) { Tcl_MutexLock(&gLock); if (gPtr != NULL) { MemFree(gPtr); } gPtr = MemAlloc(n); gAllocs++; Tcl_MutexUnlock(&gLock); } } if (ptr != NULL) { MemFree(ptr); } if (gPtr != NULL) { MemFree(gPtr); } } void MemTime() { int i; Tcl_ThreadId *tids; tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads); for (i = 0; i < nthreads; ++i) { Tcl_CreateThread( &tids[i], MemThread, NULL, TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE); } for (i = 0; i < nthreads; ++i) { Tcl_JoinThread(tids[i], NULL); } } int main (int argc, char **argv) { MemTime(); } - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/
Re: [naviserver-devel] Quest for malloc
On 18.12.2006, at 19:57, Stephen Deasey wrote: Are you saying you tested your app on Linux with native malloc and experienced no fragmentation/bloating? No. I have seen bloating but less then on zippy. I saw some bloating and fragmentation on all optimizing allocators I have tested. I think some people are experiencing fragmentation problems with ptmalloc -- the Squid and OpenLDAP guys, for example. There's also the malloc-in-one-thread, free-in-another problem, which if your threads don't exit is basically a leak. Really a leak? Why? Wouln't that depend on the implementation? Doesn't zippy also clear it's per-thread cache on exit? No. It showels all the rest to shared pool. The shared pool is never freed. Hence lots of bloating. Actually, did you experiment with exiting the conn threads after X requests? Seems to be one of the things AOL is recommending. Most of our threads are Tcl threads, not conn threads. We create them to do lots of different tasks. They are all rather short-lived. Still, the mem footprint grows and grows... One thing I wonder about this is, how do requests average out across all threads? If you set the conn threads to exit after 10,000 requests, will they all quit at roughly the same time causing an extreme load on the server? Also, this is only an option for conn threads. With scheduled proc threads, job threads etc. you get nothing. Well, if they all start to exit at the same time, they will serialize at the point where per-thread cache is pushed to the shared pool. -- --- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php? page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel
Re: [naviserver-devel] Quest for malloc
On 18.12.2006, at 22:08, Stephen Deasey wrote: Works for me. I say you can allocate memory in one thread and free it in another. Nice. Well I can say that nedmalloc works, that is, that small program runs to end w/o coring when compiled with nedmalloc. Does this prove anything?
Re: [naviserver-devel] Quest for malloc
On 12/18/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote: Still, even without the last free and with mutex around it, it core dumps in free(gPtr) during the loop. OK. Still doesn't mean your program is bug free :-) There's a lot of extra stuff going on in your example program that makes it hard to see what's going on. I simplified it to this: #include #include #include #define MemAlloc ckalloc #define MemFree ckfree void *gPtr = NULL; /* Global pointer to memory. */ void Thread(void *arg) { assert(gPtr != NULL); MemFree(gPtr); gPtr = NULL; } int main (int argc, char **argv) { Tcl_ThreadId tid; int i; for (i = 0; i < 10; ++i) { gPtr = MemAlloc(1024); assert(gPtr != NULL); Tcl_CreateThread(&tid, Thread, NULL, TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE); Tcl_JoinThread(tid, NULL); assert(gPtr == NULL); } } Works for me. I say you can allocate memory in one thread and free it in another. Let me know what the bug turns out to be..! Stephen Deasey wrote: > On 12/18/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote: >> I tried to run this program, it crahses with all allocators on free when >> it was allocated in other thread. zippy does it as well, i amnot sure >> how Naviserver works then. > > > I don't think allocate in one thread, free in another is an unusual > strategy. Googling around I see a lot of people doing it. There must > be some bugs in your program. Here's one: > > At the end of MemThread() gPtr is checked and freed, but the gMutex is > not held. This thread may have finished it's tight loop, but the other > 3 threads could still be running. Also, the gPtr is not set to NULL > after the free(), leading to a double free when the next thread checks > it. > > >> #include >> >> #define MemAlloc ckalloc >> #define MemFree ckfree >> >> int nbuffer = 16384; >> int nloops = 5; >> int nthreads = 4; >> >> int gAllocs = 0; >> void *gPtr = NULL; >> Tcl_Mutex gLock; >> >> void MemThread(void *arg) >> { >> int i,n; >> void *ptr = NULL; >> >> for (i = 0; i < nloops; ++i) { >> n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0))); >> if (ptr != NULL) { >> MemFree(ptr); >> } >> ptr = MemAlloc(n); >> // Testing inter-thread alloc/free >> if (n % 5 == 0) { >> Tcl_MutexLock(&gLock); >> if (gPtr != NULL) { >> MemFree(gPtr); >> } >> gPtr = MemAlloc(n); >> gAllocs++; >> Tcl_MutexUnlock(&gLock); >> } >> } >> if (ptr != NULL) { >> MemFree(ptr); >> } >> if (gPtr != NULL) { >> MemFree(gPtr); >> } >> } >> >> void MemTime() >> { >> int i; >> Tcl_ThreadId *tids; >> tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads); >> >> for (i = 0; i < nthreads; ++i) { >> Tcl_CreateThread( &tids[i], MemThread, NULL, >> TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE); >> } >> for (i = 0; i < nthreads; ++i) { >> Tcl_JoinThread(tids[i], NULL); >> } >> } >> >> int main (int argc, char **argv) >> { >> MemTime(); >> } > > - > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > ___ > naviserver-devel mailing list > naviserver-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/naviserver-devel > -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/ - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel
Re: [naviserver-devel] Quest for malloc
Still, even without the last free and with mutex around it, it core dumps in free(gPtr) during the loop. Stephen Deasey wrote: On 12/18/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote: I tried to run this program, it crahses with all allocators on free when it was allocated in other thread. zippy does it as well, i amnot sure how Naviserver works then. I don't think allocate in one thread, free in another is an unusual strategy. Googling around I see a lot of people doing it. There must be some bugs in your program. Here's one: At the end of MemThread() gPtr is checked and freed, but the gMutex is not held. This thread may have finished it's tight loop, but the other 3 threads could still be running. Also, the gPtr is not set to NULL after the free(), leading to a double free when the next thread checks it. #include #define MemAlloc ckalloc #define MemFree ckfree int nbuffer = 16384; int nloops = 5; int nthreads = 4; int gAllocs = 0; void *gPtr = NULL; Tcl_Mutex gLock; void MemThread(void *arg) { int i,n; void *ptr = NULL; for (i = 0; i < nloops; ++i) { n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0))); if (ptr != NULL) { MemFree(ptr); } ptr = MemAlloc(n); // Testing inter-thread alloc/free if (n % 5 == 0) { Tcl_MutexLock(&gLock); if (gPtr != NULL) { MemFree(gPtr); } gPtr = MemAlloc(n); gAllocs++; Tcl_MutexUnlock(&gLock); } } if (ptr != NULL) { MemFree(ptr); } if (gPtr != NULL) { MemFree(gPtr); } } void MemTime() { int i; Tcl_ThreadId *tids; tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads); for (i = 0; i < nthreads; ++i) { Tcl_CreateThread( &tids[i], MemThread, NULL, TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE); } for (i = 0; i < nthreads; ++i) { Tcl_JoinThread(tids[i], NULL); } } int main (int argc, char **argv) { MemTime(); } - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/
Re: [naviserver-devel] Quest for malloc
On 12/18/06, Vlad Seryakov <[EMAIL PROTECTED]> wrote: I tried to run this program, it crahses with all allocators on free when it was allocated in other thread. zippy does it as well, i amnot sure how Naviserver works then. I don't think allocate in one thread, free in another is an unusual strategy. Googling around I see a lot of people doing it. There must be some bugs in your program. Here's one: At the end of MemThread() gPtr is checked and freed, but the gMutex is not held. This thread may have finished it's tight loop, but the other 3 threads could still be running. Also, the gPtr is not set to NULL after the free(), leading to a double free when the next thread checks it. #include #define MemAlloc ckalloc #define MemFree ckfree int nbuffer = 16384; int nloops = 5; int nthreads = 4; int gAllocs = 0; void *gPtr = NULL; Tcl_Mutex gLock; void MemThread(void *arg) { int i,n; void *ptr = NULL; for (i = 0; i < nloops; ++i) { n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0))); if (ptr != NULL) { MemFree(ptr); } ptr = MemAlloc(n); // Testing inter-thread alloc/free if (n % 5 == 0) { Tcl_MutexLock(&gLock); if (gPtr != NULL) { MemFree(gPtr); } gPtr = MemAlloc(n); gAllocs++; Tcl_MutexUnlock(&gLock); } } if (ptr != NULL) { MemFree(ptr); } if (gPtr != NULL) { MemFree(gPtr); } } void MemTime() { int i; Tcl_ThreadId *tids; tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads); for (i = 0; i < nthreads; ++i) { Tcl_CreateThread( &tids[i], MemThread, NULL, TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE); } for (i = 0; i < nthreads; ++i) { Tcl_JoinThread(tids[i], NULL); } } int main (int argc, char **argv) { MemTime(); }
Re: [naviserver-devel] Quest for malloc
I tried to run this program, it crahses with all allocators on free when it was allocated in other thread. zippy does it as well, i amnot sure how Naviserver works then. #include #define MemAlloc ckalloc #define MemFree ckfree int nbuffer = 16384; int nloops = 5; int nthreads = 4; int gAllocs = 0; void *gPtr = NULL; Tcl_Mutex gLock; void MemThread(void *arg) { int i,n; void *ptr = NULL; for (i = 0; i < nloops; ++i) { n = 1 + (int) (nbuffer * (rand() / (RAND_MAX + 1.0))); if (ptr != NULL) { MemFree(ptr); } ptr = MemAlloc(n); // Testing inter-thread alloc/free if (n % 5 == 0) { Tcl_MutexLock(&gLock); if (gPtr != NULL) { MemFree(gPtr); } gPtr = MemAlloc(n); gAllocs++; Tcl_MutexUnlock(&gLock); } } if (ptr != NULL) { MemFree(ptr); } if (gPtr != NULL) { MemFree(gPtr); } } void MemTime() { int i; Tcl_ThreadId *tids; tids = (Tcl_ThreadId *)malloc(sizeof(Tcl_ThreadId) * nthreads); for (i = 0; i < nthreads; ++i) { Tcl_CreateThread( &tids[i], MemThread, NULL, TCL_THREAD_STACK_DEFAULT, TCL_THREAD_JOINABLE); } for (i = 0; i < nthreads; ++i) { Tcl_JoinThread(tids[i], NULL); } } int main (int argc, char **argv) { MemTime(); } Doesn't zippy also clear it's per-thread cache on exit? It puts blocks into shared queue which other threads can re-use. But shared cache never gets returned so conn threads exit will not help with memory bloat. -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/
Re: [naviserver-devel] Quest for malloc
On 12/18/06, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote: On 16.12.2006, at 19:31, Vlad Seryakov wrote: > But if speed is not important to you, you can supply Tcl without > zippy, > then no bloat, system is returned with reasonable speed, at least on > Linux, ptmalloc is not that bad OK. I think I've reached the peace of mind with all this alternate malloc implementations... This is what I found: On all plaforms (except the Mac OSX), it really does not pay to use anything else beside system native malloc. I mean, you can gain some percent of speed with hoard/tcmalloc/nedmalloc/zippy and friends, but you pay this with bloating memory. Are you saying you tested your app on Linux with native malloc and experienced no fragmentation/bloating? I think some people are experiencing fragmentation problems with ptmalloc -- the Squid and OpenLDAP guys, for example. There's also the malloc-in-one-thread, free-in-another problem, which if your threads don't exit is basically a leak. If it's not a problem for your app then great! Just wondering... If you can afford it, then go ahead. I believe, at least from what I've seen from my tests, that zippy is quite fast and you gain very little, if at all (speedwise) by replacing it. You can gain some less memory fragmentation by using something else, but this is not a thing that would make me say: Wow! Exception to that is really Mac OSX. The native Mac OSX malloc sucks tremendously. The speed increase by zippy and nedmalloc are so high that you can really see (without any fancy measurements), how your application flies! The nedmalloc also bloats less than zippy (normally, as it clears per-thread cache on thread exit). Doesn't zippy also clear it's per-thread cache on exit? Actually, did you experiment with exiting the conn threads after X requests? Seems to be one of the things AOL is recommending. One thing I wonder about this is, how do requests average out across all threads? If you set the conn threads to exit after 10,000 requests, will they all quit at roughly the same time causing an extreme load on the server? Also, this is only an option for conn threads. With scheduled proc threads, job threads etc. you get nothing. So for the Mac (at least for us) I will stick to nedmalloc. It is lightingly fast and reasonably conservative with memory fragmentation. Conclusion: Linux/solaris = use system malloc Mac OSX = use nedmalloc Ah, yes... windows... this I haven't tested but nedmalloc author shows some very interesting numbers on his site. I somehow tend to believe them as some I have seen by myself when experimenting on unix platforms. So, most probably the outcome will be: Windows = use nedmalloc What this means to all of us:? I would say: very little. We know that zippy is bloating and now we know that is reasonably fast and on-pair with most of the other solutions out there. For people concerned with speed, I believe this is the right solution. For people concerned with speed AND memory fragmentation (in that order) the best is to use some alternative malloc routines. For people concerned with fragmentation the best is to stay with system malloc; exception: Mac OSX. There you just need to use something else and nedmalloc is the only thing that compiles (and works) there, to my knowledge. I hope I could help somebody with this report. Cheers Zoran
Re: [naviserver-devel] Quest for malloc
On 16.12.2006, at 19:31, Vlad Seryakov wrote: But if speed is not important to you, you can supply Tcl without zippy, then no bloat, system is returned with reasonable speed, at least on Linux, ptmalloc is not that bad OK. I think I've reached the peace of mind with all this alternate malloc implementations... This is what I found: On all plaforms (except the Mac OSX), it really does not pay to use anything else beside system native malloc. I mean, you can gain some percent of speed with hoard/tcmalloc/nedmalloc/zippy and friends, but you pay this with bloating memory. If you can afford it, then go ahead. I believe, at least from what I've seen from my tests, that zippy is quite fast and you gain very little, if at all (speedwise) by replacing it. You can gain some less memory fragmentation by using something else, but this is not a thing that would make me say: Wow! Exception to that is really Mac OSX. The native Mac OSX malloc sucks tremendously. The speed increase by zippy and nedmalloc are so high that you can really see (without any fancy measurements), how your application flies! The nedmalloc also bloats less than zippy (normally, as it clears per-thread cache on thread exit). So for the Mac (at least for us) I will stick to nedmalloc. It is lightingly fast and reasonably conservative with memory fragmentation. Conclusion: Linux/solaris = use system malloc Mac OSX = use nedmalloc Ah, yes... windows... this I haven't tested but nedmalloc author shows some very interesting numbers on his site. I somehow tend to believe them as some I have seen by myself when experimenting on unix platforms. So, most probably the outcome will be: Windows = use nedmalloc What this means to all of us:? I would say: very little. We know that zippy is bloating and now we know that is reasonably fast and on-pair with most of the other solutions out there. For people concerned with speed, I believe this is the right solution. For people concerned with speed AND memory fragmentation (in that order) the best is to use some alternative malloc routines. For people concerned with fragmentation the best is to stay with system malloc; exception: Mac OSX. There you just need to use something else and nedmalloc is the only thing that compiles (and works) there, to my knowledge. I hope I could help somebody with this report. Cheers Zoran
Re: [naviserver-devel] Quest for malloc
On 16.12.2006, at 19:31, Vlad Seryakov wrote: Linux, ptmalloc is not that bad Interestingly. ptmalloc3 (http://www.malloc.de/) and nedmalloc both diverge from dlmalloc (http://gee.cs.oswego.edu/malloc.h) library from Doug lea. Consequently, their performance is similar (nedmalloc being slight faster). I have been able to verify this on the Linux box.
Re: [naviserver-devel] Quest for malloc
On 16.12.2006, at 19:31, Vlad Seryakov wrote: But if speed is not important to you, you can supply Tcl without zippy, then no bloat, system is returned with reasonable speed, at least on Linux, ptmalloc is not that bad Eh... Vlad... On the Mac the nedmalloc outperforms the standard allocator about 25 - 30 times! The same with the zippy. All tested with the supplied test program. I yet have to get real app tested... On other platforms (Linux, Solaris) yes, I can stay with the standard allocator. As the matter of fact, they are close to the nedmalloc +/- about 10-30% (in favour of nedmalloc, except on Sun/sparc). One shoe does not fit all, unfortunately... What I absolutely do not understand is: WHY? I mean, why I get 30 times difference!? It just makes no sense, but it is really true. I am absolutely confused :-((
Re: [naviserver-devel] Quest for malloc
But if speed is not important to you, you can supply Tcl without zippy, then no bloat, system is returned with reasonable speed, at least on Linux, ptmalloc is not that bad Zoran Vasiljevic wrote: On 16.12.2006, at 16:25, Stephen Deasey wrote: Something to think about: does the nedmalloc test include allocating memory in one thread and freeing it in another? Apparently this is tough for some allocators, such as Linux ptmalloc. Naviserver does this. I'm still not 100% ready reading the code but: The Tcl allocator just puts the free'd memory in the cache of the current thread that calls free(). On thread exit, or of the size of the cache exceeds some limit, the content of the cache is appended to shared cache. The memory is never returned to the system, unless it is allocated as a chunk larger that 16K. The nedmalloc does the same but does not move freed memory between the per-thread cache and the shared repository. Instead, the thread cache is emptied (freed) when a thread exits. This must be explicitly called by the user. As I see: all is green. But will pay more attention to that by reading the code more carefully... Perhaps there is some gotcha there which I would not like to discover at the customer site ;-) In nedmalloc you can disable the per-thread cache usage by defining -DTHREADCACHEMAX=0 during compilation. This makes some difference: Testing nedmalloc with 5 threads ... This allocator achieves 16194016.581962ops/sec under 5 threads w/o cache versus Testing nedmalloc with 5 threads ... This allocator achieves 18895753.973492ops/sec under 5 threads with the cache. The THREADCACHEMAX defines the size of the allocation which goes into cache, similarily to the zippy. The default is 8K (vs. 16K with zippy). The above figures were done with max 8K size. If you increase it to 16K the malloc cores :-( Too bad. Still, I believe that for long running processes, the approach of never releasing memory to the OS, as zippy is doing, is suboptimal. Speed here or there, I'd rather save myself process reboots if possible... Bad thing is that Tcl allocator (aka zippy) will not allow me any choice but bloat. And this is becomming more and more important. At some customers site I have observed process sizes of 1.5GB whereas we started with about 80MB. Eh! - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/
Re: [naviserver-devel] Quest for malloc
On 16.12.2006, at 16:25, Stephen Deasey wrote: Something to think about: does the nedmalloc test include allocating memory in one thread and freeing it in another? Apparently this is tough for some allocators, such as Linux ptmalloc. Naviserver does this. I'm still not 100% ready reading the code but: The Tcl allocator just puts the free'd memory in the cache of the current thread that calls free(). On thread exit, or of the size of the cache exceeds some limit, the content of the cache is appended to shared cache. The memory is never returned to the system, unless it is allocated as a chunk larger that 16K. The nedmalloc does the same but does not move freed memory between the per-thread cache and the shared repository. Instead, the thread cache is emptied (freed) when a thread exits. This must be explicitly called by the user. As I see: all is green. But will pay more attention to that by reading the code more carefully... Perhaps there is some gotcha there which I would not like to discover at the customer site ;-) In nedmalloc you can disable the per-thread cache usage by defining -DTHREADCACHEMAX=0 during compilation. This makes some difference: Testing nedmalloc with 5 threads ... This allocator achieves 16194016.581962ops/sec under 5 threads w/o cache versus Testing nedmalloc with 5 threads ... This allocator achieves 18895753.973492ops/sec under 5 threads with the cache. The THREADCACHEMAX defines the size of the allocation which goes into cache, similarily to the zippy. The default is 8K (vs. 16K with zippy). The above figures were done with max 8K size. If you increase it to 16K the malloc cores :-( Too bad. Still, I believe that for long running processes, the approach of never releasing memory to the OS, as zippy is doing, is suboptimal. Speed here or there, I'd rather save myself process reboots if possible... Bad thing is that Tcl allocator (aka zippy) will not allow me any choice but bloat. And this is becomming more and more important. At some customers site I have observed process sizes of 1.5GB whereas we started with about 80MB. Eh!
Re: [naviserver-devel] Quest for malloc
On 16.12.2006, at 17:29, Vlad Seryakov wrote: Instead of using threadspeed or other simple malloc/free test, i used naviserver and Tcl pages as test for allocators. Using ab from apache and stresstest it for thousand requests i test several allocators. And having everything the same except LD_PRELOAD the difference seems pretty clear. Hoard/TCmalloc/Ptmalloc2 all slower than zippy, no doubt. Using threadtest although, tcmalloc was faster than zippy, but in real life it behaves differently. So, i would suggest to you to try hit naviserver with nedmalloc. If it will be always faster than zippy, than you got what you want. Other thinks to watch, after each test see the size of nsd process. I will try nedmaloc as well later today Indeed, the best way is to checkout the real application. No test program can give you better picture! As far as this is concerned, I do plan to make this test but it takes some time! I spend the whole day getting the nedmalloc compiling OK on all platform that we use (solaris sparc/x86, mac ppc/x86, linux/x86, win). The next step is to snap it in the Tcl library and try the real application...
Re: [naviserver-devel] Quest for malloc
You can, it moves Tcl_Objs struct between thread and shared pools, same goes with other memory blocks.On thread exit all memory goes to shared pool. Zoran Vasiljevic wrote: On 16.12.2006, at 17:15, Stephen Deasey wrote: Yeah, pretty sure. You can only use Tcl objects within a single interp, which is restricted to a single thread, but general ns_malloc'd memory chunks can be passed around between threads. It would suck pretty hard if that wasn't the case. Interesting... I could swear I read it that you can't just alloc in one and free in other thread using the Tcl allocator. Well, regarding the nedmalloc, I do not know, but I can find out... - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel
Re: [naviserver-devel] Quest for malloc
Instead of using threadspeed or other simple malloc/free test, i used naviserver and Tcl pages as test for allocators. Using ab from apache and stresstest it for thousand requests i test several allocators. And having everything the same except LD_PRELOAD the difference seems pretty clear. Hoard/TCmalloc/Ptmalloc2 all slower than zippy, no doubt. Using threadtest although, tcmalloc was faster than zippy, but in real life it behaves differently. So, i would suggest to you to try hit naviserver with nedmalloc. If it will be always faster than zippy, than you got what you want. Other thinks to watch, after each test see the size of nsd process. I will try nedmaloc as well later today Stephen Deasey wrote: On 12/16/06, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote: Are you sure? AFAIK, we just go down to Tcl_Alloc in Tcl library. The allocator there will not allow you that. There were some discussions on comp.lang.tcl about it (Jeff Hobbs knows better). As they (Tcl) just "inherited" what aolserver had at that time (I believe V4.0) the same what applies to AS applies to Tcl and indirectly to us. Yeah, pretty sure. You can only use Tcl objects within a single interp, which is restricted to a single thread, but general ns_malloc'd memory chunks can be passed around between threads. It would suck pretty hard if that wasn't the case. We have a bunch of reference counted stuff, cache values for example, which we share among threads and delete when the reference count drops to zero. You can ns_register_proc from any thread, which needs to ns_free the old value... Here's the (a?) problem: http://www.bozemanpass.com/info/linux/malloc/Linux_Heap_Contention.html - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel
Re: [naviserver-devel] Quest for malloc
On 16.12.2006, at 17:15, Stephen Deasey wrote: Yeah, pretty sure. You can only use Tcl objects within a single interp, which is restricted to a single thread, but general ns_malloc'd memory chunks can be passed around between threads. It would suck pretty hard if that wasn't the case. Interesting... I could swear I read it that you can't just alloc in one and free in other thread using the Tcl allocator. Well, regarding the nedmalloc, I do not know, but I can find out...
Re: [naviserver-devel] Quest for malloc
On 12/16/06, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote: Are you sure? AFAIK, we just go down to Tcl_Alloc in Tcl library. The allocator there will not allow you that. There were some discussions on comp.lang.tcl about it (Jeff Hobbs knows better). As they (Tcl) just "inherited" what aolserver had at that time (I believe V4.0) the same what applies to AS applies to Tcl and indirectly to us. Yeah, pretty sure. You can only use Tcl objects within a single interp, which is restricted to a single thread, but general ns_malloc'd memory chunks can be passed around between threads. It would suck pretty hard if that wasn't the case. We have a bunch of reference counted stuff, cache values for example, which we share among threads and delete when the reference count drops to zero. You can ns_register_proc from any thread, which needs to ns_free the old value... Here's the (a?) problem: http://www.bozemanpass.com/info/linux/malloc/Linux_Heap_Contention.html
Re: [naviserver-devel] Quest for malloc
On 15.12.2006, at 19:59, Vlad Seryakov wrote: Will try this one. To aid you (and others): http://www.archiware.com/downloads/nedmalloc_tcl.tar.gz Download and peek at README file. This compiles on all machines I tested and works pretty fine in terms of speed. I haven't tested the memory size nor have any idea about fragmentation, but the speed is pretty good. Just look what this does on the Mac Pro (http://www.apple.com/macpro) which is currently the fastest Mac available: Testing standard allocator with 5 threads ... This allocator achieves 531241.923013ops/sec under 5 threads Testing Tcl allocator with 5 threads ... This allocator achieves 439181.119284ops/sec under 5 threads Testing nedmalloc with 5 threads ... This allocator achieves 4137423.021490ops/sec under 5 threads nedmalloc allocator is 7.788209 times faster than standard Tcl allocator is 0.826706 times faster than standard nedmalloc is 9.420767 times faster than Tcl allocator Hm... if I was not able to get same/similar results on other Mac's, I'd say this is a cheat. But it isn't. Zoran
Re: [naviserver-devel] Quest for malloc
On 16.12.2006, at 16:25, Stephen Deasey wrote: The seem, in the end, to go for Google tcmalloc. It wasn't the absolute fastest for their particular set of tests, but had dramatically lower memory usage. The down side of tcmalloc: only Linux port. The nedmalloc does them all (win, solaris, linux, macosx) as it is written in ANSI-C and designed to be portable. I tested all our Unix boxes and was able to get it running on all of them. And the integration is rather simple, just add: #include #define malloc nedmalloc #define realloc nedrealloc #define freenedfree I believe this needs to be done in just one Tcl source file. Trickier part: you need to call neddisablethreadcache(0) at every thread exit. The lower memory usage is important of course. Here I have no experience yet. Something to think about: does the nedmalloc test include allocating memory in one thread and freeing it in another? Apparently this is tough for some allocators, such as Linux ptmalloc. Naviserver does this. Are you sure? AFAIK, we just go down to Tcl_Alloc in Tcl library. The allocator there will not allow you that. There were some discussions on comp.lang.tcl about it (Jeff Hobbs knows better). As they (Tcl) just "inherited" what aolserver had at that time (I believe V4.0) the same what applies to AS applies to Tcl and indirectly to us.
Re: [naviserver-devel] Quest for malloc
On 12/16/06, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote: Hey! I think our customers will love it! I will now try to ditch the zippy and replace it with nedmalloc... Too bad that Tcl as-is does not allow easy snap-in of alternate memory allocators. I think this should be lobbied for. It would be nice to at least have a configure switch for the zippy allocator rather than having to hack up the Makefile.
Re: [naviserver-devel] Quest for malloc
On 12/16/06, Zoran Vasiljevic <[EMAIL PROTECTED]> wrote: On 15.12.2006, at 19:59, Vlad Seryakov wrote: >> >> http://www.nedprod.com/programs/portable/nedmalloc/index.html Hm... not bad at all: This was under Solaris 2.8 on a Sun Blade2500 (Sparc) 1GB memory: Testing standard allocator with 8 threads ... This allocator achieves 2098770.683107ops/sec under 8 threads Testing nedmalloc with 8 threads ... This allocator achieves 1974570.587561ops/sec under 8 threads Testing Tcl alloc with 8 threads ... This allocator achieves 1449969.176647ops/sec under 8 threads Now on a SuSE Linux, a 1.8GHz Intel: Testing standard allocator with 8 threads ... This allocator achieves 1752893.072620ops/sec under 8 threads Testing nedmalloc with 8 threads ... This allocator achieves 2114564.246869ops/sec under 8 threads Testing Tcl alloc with 8 threads ... This allocator achieves 1460851.824732ops/sec under 8 threads The Tcl library was compiled for threads and uses the zippy allocator. This is how I compiled the test program from the nedmalloc package: gcc -O -g -o test test.c -lpthread -DNDEBUG -DTCL_THREADS -I/usr/ local/include -L/usr/local/lib -ltcl8.4g I had to make some tweaks as they have a problem in pthread_islocked() private call. Also, I expanded the testsuite to include Tcl_Alloc/ Tcl_Free in addition. If I run this same thing on other platforms I get more/less same results with one notable exception: o. nedmalloc is always faster then standard or zippy, except on Sun Sparc where the built-in malloc is the fastest o. zippy (Tcl) allocator is always the slowest among the three Now, I imagine, the nedmalloc test program may not be telling all the truth (i.e. may be biased towards nedmalloc)... It would be interesting to see some other metrics... Some other metrics: http://archive.netbsd.se/?ml=OpenLDAP-devel&a=2006-07&t=2172728 The seem, in the end, to go for Google tcmalloc. It wasn't the absolute fastest for their particular set of tests, but had dramatically lower memory usage. Something to think about: does the nedmalloc test include allocating memory in one thread and freeing it in another? Apparently this is tough for some allocators, such as Linux ptmalloc. Naviserver does this.
Re: [naviserver-devel] Quest for malloc
On 16.12.2006, at 15:00, Zoran Vasiljevic wrote: On 15.12.2006, at 19:59, Vlad Seryakov wrote: http://www.nedprod.com/programs/portable/nedmalloc/index.html Hm... not bad at all: This was on a iMac with Intel Dual Core 1.83 Ghz and 512 MB memory Testing standard allocator with 8 threads ... This allocator achieves 319503.459835ops/sec under 8 threads Testing nedmalloc with 8 threads ... This allocator achieves 1687884.294403ops/sec under 8 threads Testing Tcl alloc with 8 threads ... This allocator achieves 294571.750823ops/sec under 8 threads Hey! I think our customers will love it! I will now try to ditch the zippy and replace it with nedmalloc... Too bad that Tcl as-is does not allow easy snap-in of alternate memory allocators. I think this should be lobbied for. This was under Solaris 2.8 on a Sun Blade2500 (Sparc) 1GB memory: Testing standard allocator with 8 threads ... This allocator achieves 2098770.683107ops/sec under 8 threads Testing nedmalloc with 8 threads ... This allocator achieves 1974570.587561ops/sec under 8 threads Testing Tcl alloc with 8 threads ... This allocator achieves 1449969.176647ops/sec under 8 threads Now on a SuSE Linux, a 1.8GHz Intel: Testing standard allocator with 8 threads ... This allocator achieves 1752893.072620ops/sec under 8 threads Testing nedmalloc with 8 threads ... This allocator achieves 2114564.246869ops/sec under 8 threads Testing Tcl alloc with 8 threads ... This allocator achieves 1460851.824732ops/sec under 8 threads The Tcl library was compiled for threads and uses the zippy allocator. This is how I compiled the test program from the nedmalloc package: gcc -O -g -o test test.c -lpthread -DNDEBUG -DTCL_THREADS -I/usr/ local/include -L/usr/local/lib -ltcl8.4g I had to make some tweaks as they have a problem in pthread_islocked() private call. Also, I expanded the testsuite to include Tcl_Alloc/ Tcl_Free in addition. If I run this same thing on other platforms I get more/less same results with one notable exception: o. nedmalloc is always faster then standard or zippy, except on Sun Sparc where the built-in malloc is the fastest o. zippy (Tcl) allocator is always the slowest among the three Now, I imagine, the nedmalloc test program may not be telling all the truth (i.e. may be biased towards nedmalloc)... It would be interesting to see some other metrics... Cheers Zoran -- --- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php? page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel
Re: [naviserver-devel] Quest for malloc
On 15.12.2006, at 19:59, Vlad Seryakov wrote: http://www.nedprod.com/programs/portable/nedmalloc/index.html Hm... not bad at all: This was under Solaris 2.8 on a Sun Blade2500 (Sparc) 1GB memory: Testing standard allocator with 8 threads ... This allocator achieves 2098770.683107ops/sec under 8 threads Testing nedmalloc with 8 threads ... This allocator achieves 1974570.587561ops/sec under 8 threads Testing Tcl alloc with 8 threads ... This allocator achieves 1449969.176647ops/sec under 8 threads Now on a SuSE Linux, a 1.8GHz Intel: Testing standard allocator with 8 threads ... This allocator achieves 1752893.072620ops/sec under 8 threads Testing nedmalloc with 8 threads ... This allocator achieves 2114564.246869ops/sec under 8 threads Testing Tcl alloc with 8 threads ... This allocator achieves 1460851.824732ops/sec under 8 threads The Tcl library was compiled for threads and uses the zippy allocator. This is how I compiled the test program from the nedmalloc package: gcc -O -g -o test test.c -lpthread -DNDEBUG -DTCL_THREADS -I/usr/ local/include -L/usr/local/lib -ltcl8.4g I had to make some tweaks as they have a problem in pthread_islocked() private call. Also, I expanded the testsuite to include Tcl_Alloc/ Tcl_Free in addition. If I run this same thing on other platforms I get more/less same results with one notable exception: o. nedmalloc is always faster then standard or zippy, except on Sun Sparc where the built-in malloc is the fastest o. zippy (Tcl) allocator is always the slowest among the three Now, I imagine, the nedmalloc test program may not be telling all the truth (i.e. may be biased towards nedmalloc)... It would be interesting to see some other metrics... Cheers Zoran
Re: [naviserver-devel] Quest for malloc
On 15.12.2006, at 19:59, Vlad Seryakov wrote: I also tried Hoard, Google tcmalloc, umem and some other rare mallocs i could find. Still zippy beats everybody, i ran my speed test not threadtest. Will try this one. Important: it is not only raw speed, that is important but also the memory fragmentation (i.e. lack of it). In our app we must frequently reboot the server (each couple of days) otherwise it just bloats. And... we made sure there are no leaks (have purified all libs that we use)... I now have some experience with the (zippy) fragmentation and I will try to make a testbed with this allocator and run it for several days to get some experience. Cheers Zoran
Re: [naviserver-devel] Quest for malloc
I also tried Hoard, Google tcmalloc, umem and some other rare mallocs i could find. Still zippy beats everybody, i ran my speed test not threadtest. Will try this one. Zoran Vasiljevic wrote: Hi! I've tried libumem as Stephen suggested, but it is slower than the regular system malloc. This (libumem) is really geared toward the integration with the mdb (solaris modular debugger) for memory debugging and analysis. But, I've found: http://www.nedprod.com/programs/portable/nedmalloc/index.html and this looks more promising. I have run its (supplied) test and it seems that, at least speedwise, the code is faster than native OS malloc. I will now try to make it working on all platforms that we use (admitently, it will not run correctly if you do not set -DNDEBUG to silence some assertions; this is of course not right and I have to see why/what). Anyways perhaps a thing to try out... If you get any breath-taking news with the above, share it here. On my PPC powerbook (1.5GHZ PPC, 512 MB memory) I get improvements over the built-in allocator of a factor of 3 (3 times better) with far less system overehad. I cannot say nothing about the fragmentation; this has yet to be tested. Cheers Zoran - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Vlad Seryakov 571 262-8608 office [EMAIL PROTECTED] http://www.crystalballinc.com/vlad/