Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%

2005-02-18 Thread Evan Jones
On Thu, 2005-02-17 at 22:38, Tim Peters wrote:
Then you allocate a small object, marked 's':
bbbsfff
Isn't the whole point of obmalloc is that we don't want to allocate s 
on the heap, since it is small? I guess s could be an object that 
might potentially grow.

One thing to take from that is that LFH can't be helping list-growing
in a direct way either, if LFH (as seems likely) also needs to copy
objects that grow in order to keep its internal memory segregated by
size.  The indirect benefit is still available, though:  LFH may be
helping simply by keeping smaller objects out of the general heap's
hair.
So then wouldn't this mean that there would have to be some sort of 
small object being allocated via the system malloc that is causing the 
poor behaviour? As you mention, I wouldn't think it would be list 
objects, since resizing lists using LFH should be *worse*. That would 
actually be something that is worth verifying, however. It could be 
that the Windows LFH is extra clever?

I'm afraid the only you can know for sure is by obtaining detailed
memory maps and analyzing them.
Well, it would also be useful to find out what code is calling the 
system malloc. This would make it easy to examine the code and see if 
it should be calling obmalloc or the system malloc. Any good ideas for 
easily obtaining this information? I imagine that some profilers must 
be able to produce a complete call graph?

Evan Jones
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%

2005-02-18 Thread Tim Peters
[Tim Peters]
...
 Then you allocate a small object, marked 's':

 bbbsfff

[Evan Jones]
 Isn't the whole point of obmalloc

No, because it doesn't matter what follows that introduction: 
obmalloc has several points, including exploiting the GIL, heuristics
aiming at reusing memory while it's still high in the memory
heirarchy, almost never touching a piece of memory until it's actually
needed, and so on.

 is that we don't want to allocate s on the heap, since it is small?

That's one of obmalloc's goals, yes.  But small is a relative
adjective, not absolute.  Because we're primarily talking about LFH
here, the natural meaning for small in _this_ thread is  16KB,
which is much larger than small means to obmalloc.  The memory-map
example applies just well to LFH as to obmalloc, by changing which
meaning for small you have in mind.

 I guess s could be an object that might potentially grow.

For example, list guts in Python are never handled by obmalloc,
although the small fixed-size list _header_ object is always handled
by obmalloc.

 One thing to take from that is that LFH can't be helping list-growing
 in a direct way either, if LFH (as seems likely) also needs to copy
 objects that grow in order to keep its internal memory segregated by
 size.  The indirect benefit is still available, though:  LFH may be
 helping simply by keeping smaller objects out of the general heap's
 hair.

 So then wouldn't this mean that there would have to be some sort of
 small object being allocated via the system malloc that is causing the
 poor behaviour?

Yes.  For example, a 300-character string could do it (that's not
small to obmalloc, but is to LFH).  Strings produced by pickling are
very often that large, and especially in Zope (which uses pickles
extensively under the covers -- reading and writing persistent objects
in Zope all involve pickle strings).

 As you mention, I wouldn't think it would be list objects, since resizing
 lists using LFH should be *worse*.

Until they get to LFH's boundary for small, and we have only the
vaguest idea what Martin's app does here -- we know it grows lists
containing 50K elements in the end, and ... well, that's all I really
know about it wink.

A well-known trick is applicable in that case, if Martin thinks it's
worth the bother:
grow the list to its final size once, at the start (overestimating if
you don't know for sure).  Then instead of appending, keep an index to
the next free slot, same as you'd do in C.  Then the list guts never
move, so if that doesn't yield the same kind of speedup without using
LFH, list copying wasn't actually the culprit to begin with.

 That would actually be something that is worth verifying, however.

Not worth the time to me -- Windows is closed-source, and I'm too old
to enjoy staring at binary disassemblies any more.  Besides, list guts
can't stay in LFH after the list exceeds 4K elements.  If list-copying
costs are significant here, they're far more likely to be due to
copying lists over 4K elements than under -- copying a list takes
O(len(list)) time.  So the realloc() strategy used by LFH _probably_
isn't of _primary)_ interest here.

 It could be that the Windows LFH is extra clever?

Sure -- that I doubt it moves Heaven  Earth to cater to reallocs is
just educated guessing.  I wrote my first production heap manager at
Cray Research, around 1979 wink.

 ...
 Well, it would also be useful to find out what code is calling the
 system malloc. This would make it easy to examine the code and see if
 it should be calling obmalloc or the system malloc. Any good ideas for
 easily obtaining this information? I imagine that some profilers must
 be able to produce a complete call graph?

Windows supports extensive facilities for analyzing heap usage, even
from an external process that attaches to the process you want to
analyze.  Ditto for profiling.  But it's not easy, and I don't know of
any free tools that are of real help.  If someone were motivated
enough, it would probably be easiest to run Martin's app on a Linux
box, and use the free Linux tools to analyze it.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%

2005-02-18 Thread Evan Jones
On Feb 18, 2005, at 17:51, Tim Peters wrote:
grow the list to its final size once, at the start (overestimating if
you don't know for sure).  Then instead of appending, keep an index to
the next free slot, same as you'd do in C.  Then the list guts never
move, so if that doesn't yield the same kind of speedup without using
LFH, list copying wasn't actually the culprit to begin with.
If this *does* improve the performance of his application by 15%, that  
would strongly argue for an addition to the list API similar to Java's  
ArrayList.ensureCapacity or the STL's vectorT::reserve. Since the  
list implementation already maintains separate ints for the list array  
size and the list occupied size, this would really just expose this  
implementation detail to Python. I don't like revealing the  
implementation in this fashion, but if it does make a significant  
performance difference, it could be worth it.

http://java.sun.com/j2se/1.5.0/docs/api/java/util/ 
ArrayList.html#ensureCapacity(int)
http://www.sgi.com/tech/stl/Vector.html#4

Evan Jones
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%

2005-02-18 Thread Tim Peters
[Tim Peters]
 grow the list to its final size once, at the start (overestimating if
 you don't know for sure).  Then instead of appending, keep an index to
 the next free slot, same as you'd do in C.  Then the list guts never
 move, so if that doesn't yield the same kind of speedup without using
 LFH, list copying wasn't actually the culprit to begin with.

[Evan Jones]
 If this *does* improve the performance of his application by 15%, that
 would strongly argue for an addition to the list API similar to Java's
 ArrayList.ensureCapacity or the STL's vectorT::reserve. Since the
 list implementation already maintains separate ints for the list array
 size and the list occupied size, this would really just expose this
 implementation detail to Python. I don't like revealing the
 implementation in this fashion, but if it does make a significant
 performance difference, it could be worth it.

That's a happy thought!  It was first suggested for Python in 1991
wink, but before Python 2.4 the list implementation didn't have
separate members for current size and capacity, so can't get there
from here was the only response.  It still wouldn't be trivial,
because nothing in listobject.c now believes the allocated size ever
needs to be preserved, and all len()-changing list operations ensure
that not too much overallocation remains (see list_resize() in
listobject.c for details).

But let's see whether it would help first.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


RE: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%

2005-02-17 Thread Gfeller Martin
Hi,

what immediately comes to mind are Modules/cPickle.c and Modules/cStringIO.c, 
which (I believe) are heavily used by ZODB (which in turn is heavily used by 
the application). 

The lists also get fairly large, although not huge - up to typically 5 
(complex) objects in the tests I've measured. As I said, I don't speak C, so I 
can only speculate - do the lists at some point grow beyond the upper limit of 
obmalloc, but are handled by the LFH (which has a higher upper limit, if I 
understood Tim Peters correctly)?

Best regards,
Martin





-Original Message-
From: Evan Jones [mailto:[EMAIL PROTECTED] 
Sent: Thursday, 17 Feb 2005 02:26
To: Python Dev
Cc: Gfeller Martin; Martin v. Löwis
Subject: Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%


On Feb 16, 2005, at 18:42, Martin v. Löwis wrote:
 I must admit that I'm surprised. I would have expected
 that most allocations in Python go through obmalloc, so
 the heap would only see large allocations.

 It would be interesting to find out, in your application,
 why it is still an improvement to use the low-fragmentation
 heaps.

Hmm... This is an excellent point. A grep through the Python source 
code shows that the following files call the native system malloc (I've 
excluded a few obviously platform specific files). A quick visual 
inspection shows that most of these are using it to allocate some sort 
of array or string, so it likely *should* go through the system malloc. 
Gfeller, any idea if you are using any of the modules on this list? If 
so, it would be pretty easy to try converting them to call the obmalloc 
functions instead, and see how that affects the performance.

Evan Jones


Demo/pysvr/pysvr.c
Modules/_bsddb.c
Modules/_curses_panel.c
Modules/_cursesmodule.c
Modules/_hotshot.c
Modules/_sre.c
Modules/audioop.c
Modules/bsddbmodule.c
Modules/cPickle.c
Modules/cStringIO.c
Modules/getaddrinfo.c
Modules/main.c
Modules/pyexpat.c
Modules/readline.c
Modules/regexpr.c
Modules/rgbimgmodule.c
Modules/svmodule.c
Modules/timemodule.c
Modules/zlibmodule.c
PC/getpathp.c
Python/strdup.c
Python/thread.c

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%

2005-02-17 Thread Tim Peters
[Gfeller Martin]
 what immediately comes to mind are Modules/cPickle.c and
 Modules/cStringIO.c, which (I believe) are heavily used by ZODB (which in turn
 is heavily used by the application).

I probably guessed right the first time wink:  LFH doesn't help with
the lists directly, but helps indirectly by keeping smaller objects
out of the general heap where the list guts actually live.

Say we have a general heap with a memory map like this, meaning a
contiguous range of available memory, where 'f' means a block is free.
 The units of the block don't really matter, maybe one 'f' is one
byte, maybe one 'f' is 4MB -- it's all the same in the end:

fff

Now you allocate a relatively big object (like the guts of a large
list), and it's assigned a contiguous range of blocks marked 'b':

bbb

Then you allocate a small object, marked 's':

bbbsfff

The you want to grow the big object.  Oops!  It can't extend the block
of b's in-place, because 's' is in the way.  Instead it has to copy
the whole darn thing:

fffsbbb

But if 's' is allocated from some _other_ heap, then the big object
can grow in-place, and that's much more efficient than copying the
whole thing.

obmalloc has two primary effects:  it manages a large number of very
small (= 256 bytes) memory chunks very efficiently, but it _also_
helps larger objects indirectly, by keeping the very small objects out
of the platform C malloc's way.

LFH appears to be an extension of the same basic idea, raising the
small object limit to 16KB.

Now note that pymalloc and LFH are *bad* ideas for objects that want
to grow.  pymalloc and LFH segregate the memory they manage into
blocks of different sizes.  For example, pymalloc keeps a list of free
blocks each of which is exactly 64 bytes long.  Taking a 64-byte block
out of that list, or putting it back in, is very efficient.  But if an
object that uses a 64-byte block wants to grow, pymalloc can _never_
grow it in-place, it always has to copy it.  That's a cost that comes
with segregating memory by size, and for that reason Python
deliberately doesn't use pymalloc in several cases where objects are
expected to grow over time.

One thing to take from that is that LFH can't be helping list-growing
in a direct way either, if LFH (as seems likely) also needs to copy
objects that grow in order to keep its internal memory segregated by
size.  The indirect benefit is still available, though:  LFH may be
helping simply by keeping smaller objects out of the general heap's
hair.

 The lists also get fairly large, although not huge - up to typically 5
 (complex) objects in the tests I've measured.

That's much larger than LFH can handle.  Its limit is 16KB.  A Python
list with 50K elements requires a contiguous chunk of 200KB on a
32-bit machine to hold the list guts.

 As I said, I don't speak C, so I can only speculate - do the lists at some 
 point
grow beyond the upper limit of obmalloc, but are handled by the LFH
(which has a
 higher upper limit, if I understood Tim Peters correctly)?

A Python list object comprises two separately allocated pieces of
memory.  First is a list header, a small piece of memory of fixed
size, independent of len(list).  The list header is always obtained
from obmalloc; LFH will never be involved with that, and neither will
the system malloc.  The list header has a pointer to a separate piece
of memory, which contains the guts of a list, a contiguous vector of
len(list) pionters (to Python objects).  For a list of length n, this
needs 4*n bytes on a 32-bit box.  obmalloc never manages that space,
and for the reason given above:  we expect that list guts may grow,
and obmalloc is meant for fixed-size chunks of memory.

So the list guts will get handled by LFH, until the list needs more
than 4K entries (hitting the 16KB LFH limit).  Until then, LFH
probably wastes time by copying growing list guts from size class to
size class.  Then the list guts finally get copied to the general
heap, and stay there.

I'm afraid the only you can know for sure is by obtaining detailed
memory maps and analyzing them.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%

2005-02-16 Thread Gfeller Martin
Dear all,

I'm running a large Zope application on a 1x1GHz CPU 1GB mem 
Window XP Prof machine using Zope 2.7.3 and Py 2.3.4 
The application typically builds large lists by appending 
and extending them. 

We regularly observed that using a given functionality a 
second time using the same process was much slower (50%) 
than when it ran the first time after startup. 
This behavior greatly improved with Python 2.3 (thanks 
to the improved Python object allocator, I presume). 

Nevertheless, I tried to convert the heap used by Python 
to a Windows Low Fragmentation Heap (available on XP 
and 2003 Server). This improved the overall run time 
of a typical CPU-intensive report by about 15% 
(overall run time is in the 5 minutes range), with the
same memory consumption.

I consider 15% significant enough to let you know about it.

For information about the Low Fragmentation Heap, see
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/memory/base/low_fragmentation_heap.asp

Best regards,
Martin 

PS: Since I don't speak C, I used ctypes to convert all 
heaps in the process to LFH (I don't know how to determine
which one is the C heap).





COMIT AG
Risk Management Systems
Pflanzschulstrasse 7 
CH-8004 Zürich 

Telefon +41 (44) 1 298 92 84 

http://www.comit.ch 
http://www.quantax.com - Quantax Trading and Risk System

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%

2005-02-16 Thread =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Gfeller Martin wrote:
Nevertheless, I tried to convert the heap used by Python 
to a Windows Low Fragmentation Heap (available on XP 
and 2003 Server). This improved the overall run time 
of a typical CPU-intensive report by about 15% 
(overall run time is in the 5 minutes range), with the
same memory consumption.
I must admit that I'm surprised. I would have expected
that most allocations in Python go through obmalloc, so
the heap would only see large allocations.
It would be interesting to find out, in your application,
why it is still an improvement to use the low-fragmentation
heaps.
Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%

2005-02-16 Thread Evan Jones
On Feb 16, 2005, at 18:42, Martin v. Löwis wrote:
I must admit that I'm surprised. I would have expected
that most allocations in Python go through obmalloc, so
the heap would only see large allocations.
It would be interesting to find out, in your application,
why it is still an improvement to use the low-fragmentation
heaps.
Hmm... This is an excellent point. A grep through the Python source 
code shows that the following files call the native system malloc (I've 
excluded a few obviously platform specific files). A quick visual 
inspection shows that most of these are using it to allocate some sort 
of array or string, so it likely *should* go through the system malloc. 
Gfeller, any idea if you are using any of the modules on this list? If 
so, it would be pretty easy to try converting them to call the obmalloc 
functions instead, and see how that affects the performance.

Evan Jones
Demo/pysvr/pysvr.c
Modules/_bsddb.c
Modules/_curses_panel.c
Modules/_cursesmodule.c
Modules/_hotshot.c
Modules/_sre.c
Modules/audioop.c
Modules/bsddbmodule.c
Modules/cPickle.c
Modules/cStringIO.c
Modules/getaddrinfo.c
Modules/main.c
Modules/pyexpat.c
Modules/readline.c
Modules/regexpr.c
Modules/rgbimgmodule.c
Modules/svmodule.c
Modules/timemodule.c
Modules/zlibmodule.c
PC/getpathp.c
Python/strdup.c
Python/thread.c
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com