Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%
On Thu, 2005-02-17 at 22:38, Tim Peters wrote: Then you allocate a small object, marked 's': bbbsfff Isn't the whole point of obmalloc is that we don't want to allocate s on the heap, since it is small? I guess s could be an object that might potentially grow. One thing to take from that is that LFH can't be helping list-growing in a direct way either, if LFH (as seems likely) also needs to copy objects that grow in order to keep its internal memory segregated by size. The indirect benefit is still available, though: LFH may be helping simply by keeping smaller objects out of the general heap's hair. So then wouldn't this mean that there would have to be some sort of small object being allocated via the system malloc that is causing the poor behaviour? As you mention, I wouldn't think it would be list objects, since resizing lists using LFH should be *worse*. That would actually be something that is worth verifying, however. It could be that the Windows LFH is extra clever? I'm afraid the only you can know for sure is by obtaining detailed memory maps and analyzing them. Well, it would also be useful to find out what code is calling the system malloc. This would make it easy to examine the code and see if it should be calling obmalloc or the system malloc. Any good ideas for easily obtaining this information? I imagine that some profilers must be able to produce a complete call graph? Evan Jones ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%
[Tim Peters] ... Then you allocate a small object, marked 's': bbbsfff [Evan Jones] Isn't the whole point of obmalloc No, because it doesn't matter what follows that introduction: obmalloc has several points, including exploiting the GIL, heuristics aiming at reusing memory while it's still high in the memory heirarchy, almost never touching a piece of memory until it's actually needed, and so on. is that we don't want to allocate s on the heap, since it is small? That's one of obmalloc's goals, yes. But small is a relative adjective, not absolute. Because we're primarily talking about LFH here, the natural meaning for small in _this_ thread is 16KB, which is much larger than small means to obmalloc. The memory-map example applies just well to LFH as to obmalloc, by changing which meaning for small you have in mind. I guess s could be an object that might potentially grow. For example, list guts in Python are never handled by obmalloc, although the small fixed-size list _header_ object is always handled by obmalloc. One thing to take from that is that LFH can't be helping list-growing in a direct way either, if LFH (as seems likely) also needs to copy objects that grow in order to keep its internal memory segregated by size. The indirect benefit is still available, though: LFH may be helping simply by keeping smaller objects out of the general heap's hair. So then wouldn't this mean that there would have to be some sort of small object being allocated via the system malloc that is causing the poor behaviour? Yes. For example, a 300-character string could do it (that's not small to obmalloc, but is to LFH). Strings produced by pickling are very often that large, and especially in Zope (which uses pickles extensively under the covers -- reading and writing persistent objects in Zope all involve pickle strings). As you mention, I wouldn't think it would be list objects, since resizing lists using LFH should be *worse*. Until they get to LFH's boundary for small, and we have only the vaguest idea what Martin's app does here -- we know it grows lists containing 50K elements in the end, and ... well, that's all I really know about it wink. A well-known trick is applicable in that case, if Martin thinks it's worth the bother: grow the list to its final size once, at the start (overestimating if you don't know for sure). Then instead of appending, keep an index to the next free slot, same as you'd do in C. Then the list guts never move, so if that doesn't yield the same kind of speedup without using LFH, list copying wasn't actually the culprit to begin with. That would actually be something that is worth verifying, however. Not worth the time to me -- Windows is closed-source, and I'm too old to enjoy staring at binary disassemblies any more. Besides, list guts can't stay in LFH after the list exceeds 4K elements. If list-copying costs are significant here, they're far more likely to be due to copying lists over 4K elements than under -- copying a list takes O(len(list)) time. So the realloc() strategy used by LFH _probably_ isn't of _primary)_ interest here. It could be that the Windows LFH is extra clever? Sure -- that I doubt it moves Heaven Earth to cater to reallocs is just educated guessing. I wrote my first production heap manager at Cray Research, around 1979 wink. ... Well, it would also be useful to find out what code is calling the system malloc. This would make it easy to examine the code and see if it should be calling obmalloc or the system malloc. Any good ideas for easily obtaining this information? I imagine that some profilers must be able to produce a complete call graph? Windows supports extensive facilities for analyzing heap usage, even from an external process that attaches to the process you want to analyze. Ditto for profiling. But it's not easy, and I don't know of any free tools that are of real help. If someone were motivated enough, it would probably be easiest to run Martin's app on a Linux box, and use the free Linux tools to analyze it. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%
On Feb 18, 2005, at 17:51, Tim Peters wrote: grow the list to its final size once, at the start (overestimating if you don't know for sure). Then instead of appending, keep an index to the next free slot, same as you'd do in C. Then the list guts never move, so if that doesn't yield the same kind of speedup without using LFH, list copying wasn't actually the culprit to begin with. If this *does* improve the performance of his application by 15%, that would strongly argue for an addition to the list API similar to Java's ArrayList.ensureCapacity or the STL's vectorT::reserve. Since the list implementation already maintains separate ints for the list array size and the list occupied size, this would really just expose this implementation detail to Python. I don't like revealing the implementation in this fashion, but if it does make a significant performance difference, it could be worth it. http://java.sun.com/j2se/1.5.0/docs/api/java/util/ ArrayList.html#ensureCapacity(int) http://www.sgi.com/tech/stl/Vector.html#4 Evan Jones ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%
[Tim Peters] grow the list to its final size once, at the start (overestimating if you don't know for sure). Then instead of appending, keep an index to the next free slot, same as you'd do in C. Then the list guts never move, so if that doesn't yield the same kind of speedup without using LFH, list copying wasn't actually the culprit to begin with. [Evan Jones] If this *does* improve the performance of his application by 15%, that would strongly argue for an addition to the list API similar to Java's ArrayList.ensureCapacity or the STL's vectorT::reserve. Since the list implementation already maintains separate ints for the list array size and the list occupied size, this would really just expose this implementation detail to Python. I don't like revealing the implementation in this fashion, but if it does make a significant performance difference, it could be worth it. That's a happy thought! It was first suggested for Python in 1991 wink, but before Python 2.4 the list implementation didn't have separate members for current size and capacity, so can't get there from here was the only response. It still wouldn't be trivial, because nothing in listobject.c now believes the allocated size ever needs to be preserved, and all len()-changing list operations ensure that not too much overallocation remains (see list_resize() in listobject.c for details). But let's see whether it would help first. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
RE: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%
Hi, what immediately comes to mind are Modules/cPickle.c and Modules/cStringIO.c, which (I believe) are heavily used by ZODB (which in turn is heavily used by the application). The lists also get fairly large, although not huge - up to typically 5 (complex) objects in the tests I've measured. As I said, I don't speak C, so I can only speculate - do the lists at some point grow beyond the upper limit of obmalloc, but are handled by the LFH (which has a higher upper limit, if I understood Tim Peters correctly)? Best regards, Martin -Original Message- From: Evan Jones [mailto:[EMAIL PROTECTED] Sent: Thursday, 17 Feb 2005 02:26 To: Python Dev Cc: Gfeller Martin; Martin v. Löwis Subject: Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15% On Feb 16, 2005, at 18:42, Martin v. Löwis wrote: I must admit that I'm surprised. I would have expected that most allocations in Python go through obmalloc, so the heap would only see large allocations. It would be interesting to find out, in your application, why it is still an improvement to use the low-fragmentation heaps. Hmm... This is an excellent point. A grep through the Python source code shows that the following files call the native system malloc (I've excluded a few obviously platform specific files). A quick visual inspection shows that most of these are using it to allocate some sort of array or string, so it likely *should* go through the system malloc. Gfeller, any idea if you are using any of the modules on this list? If so, it would be pretty easy to try converting them to call the obmalloc functions instead, and see how that affects the performance. Evan Jones Demo/pysvr/pysvr.c Modules/_bsddb.c Modules/_curses_panel.c Modules/_cursesmodule.c Modules/_hotshot.c Modules/_sre.c Modules/audioop.c Modules/bsddbmodule.c Modules/cPickle.c Modules/cStringIO.c Modules/getaddrinfo.c Modules/main.c Modules/pyexpat.c Modules/readline.c Modules/regexpr.c Modules/rgbimgmodule.c Modules/svmodule.c Modules/timemodule.c Modules/zlibmodule.c PC/getpathp.c Python/strdup.c Python/thread.c ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%
[Gfeller Martin] what immediately comes to mind are Modules/cPickle.c and Modules/cStringIO.c, which (I believe) are heavily used by ZODB (which in turn is heavily used by the application). I probably guessed right the first time wink: LFH doesn't help with the lists directly, but helps indirectly by keeping smaller objects out of the general heap where the list guts actually live. Say we have a general heap with a memory map like this, meaning a contiguous range of available memory, where 'f' means a block is free. The units of the block don't really matter, maybe one 'f' is one byte, maybe one 'f' is 4MB -- it's all the same in the end: fff Now you allocate a relatively big object (like the guts of a large list), and it's assigned a contiguous range of blocks marked 'b': bbb Then you allocate a small object, marked 's': bbbsfff The you want to grow the big object. Oops! It can't extend the block of b's in-place, because 's' is in the way. Instead it has to copy the whole darn thing: fffsbbb But if 's' is allocated from some _other_ heap, then the big object can grow in-place, and that's much more efficient than copying the whole thing. obmalloc has two primary effects: it manages a large number of very small (= 256 bytes) memory chunks very efficiently, but it _also_ helps larger objects indirectly, by keeping the very small objects out of the platform C malloc's way. LFH appears to be an extension of the same basic idea, raising the small object limit to 16KB. Now note that pymalloc and LFH are *bad* ideas for objects that want to grow. pymalloc and LFH segregate the memory they manage into blocks of different sizes. For example, pymalloc keeps a list of free blocks each of which is exactly 64 bytes long. Taking a 64-byte block out of that list, or putting it back in, is very efficient. But if an object that uses a 64-byte block wants to grow, pymalloc can _never_ grow it in-place, it always has to copy it. That's a cost that comes with segregating memory by size, and for that reason Python deliberately doesn't use pymalloc in several cases where objects are expected to grow over time. One thing to take from that is that LFH can't be helping list-growing in a direct way either, if LFH (as seems likely) also needs to copy objects that grow in order to keep its internal memory segregated by size. The indirect benefit is still available, though: LFH may be helping simply by keeping smaller objects out of the general heap's hair. The lists also get fairly large, although not huge - up to typically 5 (complex) objects in the tests I've measured. That's much larger than LFH can handle. Its limit is 16KB. A Python list with 50K elements requires a contiguous chunk of 200KB on a 32-bit machine to hold the list guts. As I said, I don't speak C, so I can only speculate - do the lists at some point grow beyond the upper limit of obmalloc, but are handled by the LFH (which has a higher upper limit, if I understood Tim Peters correctly)? A Python list object comprises two separately allocated pieces of memory. First is a list header, a small piece of memory of fixed size, independent of len(list). The list header is always obtained from obmalloc; LFH will never be involved with that, and neither will the system malloc. The list header has a pointer to a separate piece of memory, which contains the guts of a list, a contiguous vector of len(list) pionters (to Python objects). For a list of length n, this needs 4*n bytes on a 32-bit box. obmalloc never manages that space, and for the reason given above: we expect that list guts may grow, and obmalloc is meant for fixed-size chunks of memory. So the list guts will get handled by LFH, until the list needs more than 4K entries (hitting the 16KB LFH limit). Until then, LFH probably wastes time by copying growing list guts from size class to size class. Then the list guts finally get copied to the general heap, and stay there. I'm afraid the only you can know for sure is by obtaining detailed memory maps and analyzing them. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%
Dear all, I'm running a large Zope application on a 1x1GHz CPU 1GB mem Window XP Prof machine using Zope 2.7.3 and Py 2.3.4 The application typically builds large lists by appending and extending them. We regularly observed that using a given functionality a second time using the same process was much slower (50%) than when it ran the first time after startup. This behavior greatly improved with Python 2.3 (thanks to the improved Python object allocator, I presume). Nevertheless, I tried to convert the heap used by Python to a Windows Low Fragmentation Heap (available on XP and 2003 Server). This improved the overall run time of a typical CPU-intensive report by about 15% (overall run time is in the 5 minutes range), with the same memory consumption. I consider 15% significant enough to let you know about it. For information about the Low Fragmentation Heap, see http://msdn.microsoft.com/library/default.asp?url=/library/en-us/memory/base/low_fragmentation_heap.asp Best regards, Martin PS: Since I don't speak C, I used ctypes to convert all heaps in the process to LFH (I don't know how to determine which one is the C heap). COMIT AG Risk Management Systems Pflanzschulstrasse 7 CH-8004 Zürich Telefon +41 (44) 1 298 92 84 http://www.comit.ch http://www.quantax.com - Quantax Trading and Risk System ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%
Gfeller Martin wrote: Nevertheless, I tried to convert the heap used by Python to a Windows Low Fragmentation Heap (available on XP and 2003 Server). This improved the overall run time of a typical CPU-intensive report by about 15% (overall run time is in the 5 minutes range), with the same memory consumption. I must admit that I'm surprised. I would have expected that most allocations in Python go through obmalloc, so the heap would only see large allocations. It would be interesting to find out, in your application, why it is still an improvement to use the low-fragmentation heaps. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%
On Feb 16, 2005, at 18:42, Martin v. Löwis wrote: I must admit that I'm surprised. I would have expected that most allocations in Python go through obmalloc, so the heap would only see large allocations. It would be interesting to find out, in your application, why it is still an improvement to use the low-fragmentation heaps. Hmm... This is an excellent point. A grep through the Python source code shows that the following files call the native system malloc (I've excluded a few obviously platform specific files). A quick visual inspection shows that most of these are using it to allocate some sort of array or string, so it likely *should* go through the system malloc. Gfeller, any idea if you are using any of the modules on this list? If so, it would be pretty easy to try converting them to call the obmalloc functions instead, and see how that affects the performance. Evan Jones Demo/pysvr/pysvr.c Modules/_bsddb.c Modules/_curses_panel.c Modules/_cursesmodule.c Modules/_hotshot.c Modules/_sre.c Modules/audioop.c Modules/bsddbmodule.c Modules/cPickle.c Modules/cStringIO.c Modules/getaddrinfo.c Modules/main.c Modules/pyexpat.c Modules/readline.c Modules/regexpr.c Modules/rgbimgmodule.c Modules/svmodule.c Modules/timemodule.c Modules/zlibmodule.c PC/getpathp.c Python/strdup.c Python/thread.c ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com