Re: Possible to set cpython heap size?

2007-02-23 Thread Andrew MacIntyre
Chris Mellon wrote:
 On 22 Feb 2007 11:28:52 -0800, Andy Watson [EMAIL PROTECTED] wrote:
 On Feb 22, 10:53 am, a bunch of folks wrote:

 Memory is basically free.
 This is true if you are simply scanning a file into memory.  However,
 I'm storing the contents in some in-memory data structures and doing
 some data manipulation.   This is my speculation:

 Several small objects per scanned line get allocated, and then
 unreferenced.  If the heap is relatively small, GC has to do some work
 in order to make space for subsequent scan results.  At some point, it
 realises it cannot keep up and has to extend the heap.  At this point,
 VM and physical memory is committed, since it needs to be used.  And
 this keeps going on.  At some point, GC will take a good deal of time
 to compact the heap, since I and loading in so much data and creating
 a lot of smaller objects.

 If I could have a heap that is larger and does not need to be
 dynamically extended, then the Python GC could work more efficiently.

 
 I haven't even looked at Python memory management internals since 2.3,
 and not in detail then, so I'm sure someone will correct me in the
 case that I am wrong.
 
 However, I believe that this is almost exactly how CPython GC does not
 work. CPython is refcounted with a generational GC for cycle
 detection. There's a memory pool that is used for object allocation
 (more than one, I think, for different types of objects) and those can
 be extended but they are not, to my knowledge, compacted.
 
 If you're creating the same small objects for each scanned lines, and
 especially if they are tuples or new-style objects with __slots__,
 then the memory use for those objects should be more or less constant.
 Your memory growth is probably related to the information you're
 saving, not to your scanned objects, and since those are long-lived
 objects I simple don't see how heap pre-allocation could be helpful
 there.

Python's internal memory management is split:
- allocations up to 256 bytes (the majority of objects) are handled by
a custom allocator, which uses 256kB arenas malloc()ed from the OS on
demand.  With 2.5 some additional work was done to allow returning
completely empty arenas to the OS; 2.3 and 2.4 don't return arenas at
all.
- all allocations over 256 bytes, including container objects that are
extended beyond 256 bytes, are made by malloc().

I can't recall off-hand whether the free-list structures for ints (and
floats?) use the Python allocator or direct malloc(); as the free-lists
don't release any entries, I suspect not.

The maximum allocation size and arena size used by the Python allocator
are hard-coded for algorithmic and performance reasons, and cannot be
practically be changed, especially at runtime.  No active compaction
takes place in arenas, even with GC.  The only time object data is
relocated between arenas is when an object is resized.

If Andy Watson is creating loads of objects that aren't being managed
by Python's allocator (by being larger than 256 bytes, or in a type 
free-list), then the platform malloc() behaviour applies.  Some platform
allocators can be tuned via environment variables and the like, in which
case review of the platform documentation is indicated.

Some platform allocators are notorious for poor behaviour in certain 
circumstances, and coalescing blocks while deallocating is one 
particularly nasty problem for code that creates and destroys lots
of small variably sized objects.

-- 
-
Andrew I MacIntyre These thoughts are mine alone...
E-mail: [EMAIL PROTECTED]  (pref) | Snail: PO Box 370
[EMAIL PROTECTED] (alt) |Belconnen ACT 2616
Web:http://www.andymac.org/   |Australia
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Possible to set cpython heap size?

2007-02-23 Thread Tony Nelson
In article [EMAIL PROTECTED],
 Andy Watson [EMAIL PROTECTED] wrote:

 ...
 If I could have a heap that is larger and does not need to be
 dynamically extended, then the Python GC could work more efficiently.
 ...

GC!  If you're allocating lots of objects and holding on to them, GC 
will run frequently, but won't find anything to free.  Maybe you want to 
turn off GC, at least some of the time?  See the GC module, esp. 
set_threshold().

Note that the cyclic GC is only really a sort of safety net for 
reference loops, as normally objects are free'd when their last 
reference is lost.

TonyN.:'[EMAIL PROTECTED]
  '  http://www.georgeanelson.com/
-- 
http://mail.python.org/mailman/listinfo/python-list


Possible to set cpython heap size?

2007-02-22 Thread Andy Watson
I have an application that scans and processes a bunch of text files.
The content I'm pulling out and holding in memory is at least 200MB.

I'd love to be able to tell the CPython virtual machine that I need a
heap of, say 300MB up front rather than have it grow as needed.   I've
had a scan through the archives of comp.lang.python and the python
docs but cannot find a way to do this.  Is this possible to configure
the PVM this way?

Much appreciated,
Andy
--

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Possible to set cpython heap size?

2007-02-22 Thread Diez B. Roggisch
Andy Watson wrote:

 I have an application that scans and processes a bunch of text files.
 The content I'm pulling out and holding in memory is at least 200MB.
 
 I'd love to be able to tell the CPython virtual machine that I need a
 heap of, say 300MB up front rather than have it grow as needed.   I've
 had a scan through the archives of comp.lang.python and the python
 docs but cannot find a way to do this.  Is this possible to configure
 the PVM this way?

Why do you want that? And no, it is not possible. And to be honest: I have
no idea why e.g. the JVM allows for this.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Possible to set cpython heap size?

2007-02-22 Thread Andy Watson
  Why do you want that? And no, it is not possible. And to be honest:
I have
 no idea why e.g. the JVM allows for this.

 Diez

The reason why is simply that I know roughly how much memory I'm going
to need, and cpython seems to be taking a fair amount of time
extending its heap as I read in content incrementally.

Ta,
Andy
--

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Possible to set cpython heap size?

2007-02-22 Thread Diez B. Roggisch
Andy Watson wrote:

   Why do you want that? And no, it is not possible. And to be honest:
 I have
 no idea why e.g. the JVM allows for this.
 
 The reason why is simply that I know roughly how much memory I'm going
 to need, and cpython seems to be taking a fair amount of time
 extending its heap as I read in content incrementally.

I'm not an expert in python malloc schemes, I know that _some_ things are
heavily optimized, but I'm not aware that it does some clever
self-management of heap in the general case. Which would be complicated in
the presence of arbitrary C extensions anyway.


However, I'm having doubts that your observation is correct. A simple

python -m timeit -n 1 -r 1 range(5000)
1 loops, best of 1: 2.38 sec per loop

will create a python-process of half a gig ram - for a split-second - and I
don't consider 2.38 seconds a fair amount of time for heap allocation.

When I used a 4 times larger argument, my machine began swapping. THEN
things became ugly - but I don't see how preallocation will help there...

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Possible to set cpython heap size?

2007-02-22 Thread Irmen de Jong
Andy Watson wrote:
   Why do you want that? And no, it is not possible. And to be honest:
 I have
 no idea why e.g. the JVM allows for this.

 Diez
 
 The reason why is simply that I know roughly how much memory I'm going
 to need, and cpython seems to be taking a fair amount of time

^
 extending its heap as I read in content incrementally.

First make sure this is really the case.
It may be that you are just using an inefficient algorithm.
In my experience allocating extra heap memory is hardly ever
noticeable. Unless your system is out of physical RAM and has
to swap.

--Irmen
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Possible to set cpython heap size?

2007-02-22 Thread Chris Mellon
On 22 Feb 2007 09:52:49 -0800, Andy Watson [EMAIL PROTECTED] wrote:
   Why do you want that? And no, it is not possible. And to be honest:
 I have
  no idea why e.g. the JVM allows for this.
 
  Diez

 The reason why is simply that I know roughly how much memory I'm going
 to need, and cpython seems to be taking a fair amount of time
 extending its heap as I read in content incrementally.


To my knowledge, no modern OS actually commits any memory at all to a
process until it is written to. Pre-extending the heap would either a)
do nothing, because it'd be essentially a noop, or b) would take at
least long as doing it incrementally (because Python would need to
fill up all that space with objects), without giving you any actual
performance gain when you fill the object space for real.

In Java, as I understand it, having a fixed size heap allows some
optimizations in the garbage collector. Pythons GC model is different
and, as far as I know, is unlikely to benefit from this.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Possible to set cpython heap size?

2007-02-22 Thread Jussi Salmela
Andy Watson kirjoitti:
 I have an application that scans and processes a bunch of text files.
 The content I'm pulling out and holding in memory is at least 200MB.
 
 I'd love to be able to tell the CPython virtual machine that I need a
 heap of, say 300MB up front rather than have it grow as needed.   I've
 had a scan through the archives of comp.lang.python and the python
 docs but cannot find a way to do this.  Is this possible to configure
 the PVM this way?
 
 Much appreciated,
 Andy
 --
 

Others have already suggested swap as a possible cause of slowness. I've
been playing with my portable (dual Intel T2300 @ 1.66 GHz; 1 GB of mem 
; Win XP ; Python Scripter IDE)
using the following code:

#===
import datetime

'''
# Create 10 files with sizes 1MB, ..., 10MB
for i in range(1,11):
 print 'Writing: ' + 'Bytes_' + str(i*100)
 f = open('Bytes_' + str(i*100), 'w')
 f.write(str(i-1)*i*100)
 f.close()
'''

# Read the files 5 times concatenating the contents
# to one HUGE string
now_1 = datetime.datetime.now()
s = ''
for count in range(5):
 for i in range(1,11):
 print 'Reading: ' + 'Bytes_' + str(i*100)
 f = open('Bytes_' + str(i*100), 'r')
 s = s + f.read()
 f.close()
 print 'Size of s is', len(s)
print 's[27499] = ' + s[27499]
now_2 = datetime.datetime.now()
print now_1
print now_2
raw_input('???')
#===

The part at the start that is commented out is the part I used to create 
the 10 files. The second part prints the following output (abbreviated):

Reading: Bytes_100
Size of s is 100
Reading: Bytes_200
Size of s is 300
Reading: Bytes_300
Size of s is 600
Reading: Bytes_400
Size of s is 1000
Reading: Bytes_500
Size of s is 1500
Reading: Bytes_600
Size of s is 2100
Reading: Bytes_700
Size of s is 2800
Reading: Bytes_800
Size of s is 3600
Reading: Bytes_900
Size of s is 4500
Reading: Bytes_1000
Size of s is 5500
snip
Reading: Bytes_900
Size of s is 26500
Reading: Bytes_1000
Size of s is 27500
s[27499] = 9
2007-02-22 20:23:09.984000
2007-02-22 20:23:21.515000

As can be seen creating a string of 275 MB reading the parts from the 
files took less than 12 seconds. I think this is fast enough, but others 
might disagree! ;)

Using the Win Task Manager I can see the process to grow to a little 
less than 282 MB when it reaches the raw_input call and to drop to less 
than 13 MB a little after I've given some input apparently as a result 
of PyScripter doing a GC.

Your situation (hardware, file sizes etc.) may differ so that my 
experiment does not correspond it, but this was my 2 cents worth!

HTH,
Jussi
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Possible to set cpython heap size?

2007-02-22 Thread Andy Watson
On Feb 22, 10:53 am, a bunch of folks wrote:

 Memory is basically free.

This is true if you are simply scanning a file into memory.  However,
I'm storing the contents in some in-memory data structures and doing
some data manipulation.   This is my speculation:

Several small objects per scanned line get allocated, and then
unreferenced.  If the heap is relatively small, GC has to do some work
in order to make space for subsequent scan results.  At some point, it
realises it cannot keep up and has to extend the heap.  At this point,
VM and physical memory is committed, since it needs to be used.  And
this keeps going on.  At some point, GC will take a good deal of time
to compact the heap, since I and loading in so much data and creating
a lot of smaller objects.

If I could have a heap that is larger and does not need to be
dynamically extended, then the Python GC could work more efficiently.

Interesting discussion.
Cheers,
Andy
--

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Possible to set cpython heap size?

2007-02-22 Thread Chris Mellon
On 22 Feb 2007 11:28:52 -0800, Andy Watson [EMAIL PROTECTED] wrote:
 On Feb 22, 10:53 am, a bunch of folks wrote:

  Memory is basically free.

 This is true if you are simply scanning a file into memory.  However,
 I'm storing the contents in some in-memory data structures and doing
 some data manipulation.   This is my speculation:

 Several small objects per scanned line get allocated, and then
 unreferenced.  If the heap is relatively small, GC has to do some work
 in order to make space for subsequent scan results.  At some point, it
 realises it cannot keep up and has to extend the heap.  At this point,
 VM and physical memory is committed, since it needs to be used.  And
 this keeps going on.  At some point, GC will take a good deal of time
 to compact the heap, since I and loading in so much data and creating
 a lot of smaller objects.

 If I could have a heap that is larger and does not need to be
 dynamically extended, then the Python GC could work more efficiently.


I haven't even looked at Python memory management internals since 2.3,
and not in detail then, so I'm sure someone will correct me in the
case that I am wrong.

However, I believe that this is almost exactly how CPython GC does not
work. CPython is refcounted with a generational GC for cycle
detection. There's a memory pool that is used for object allocation
(more than one, I think, for different types of objects) and those can
be extended but they are not, to my knowledge, compacted.

If you're creating the same small objects for each scanned lines, and
especially if they are tuples or new-style objects with __slots__,
then the memory use for those objects should be more or less constant.
Your memory growth is probably related to the information you're
saving, not to your scanned objects, and since those are long-lived
objects I simple don't see how heap pre-allocation could be helpful
there.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Possible to set cpython heap size?

2007-02-22 Thread Martin v. Löwis
Andy Watson schrieb:
 I have an application that scans and processes a bunch of text files.
 The content I'm pulling out and holding in memory is at least 200MB.
 
 I'd love to be able to tell the CPython virtual machine that I need a
 heap of, say 300MB up front rather than have it grow as needed.   I've
 had a scan through the archives of comp.lang.python and the python
 docs but cannot find a way to do this.  Is this possible to configure
 the PVM this way?

You can configure your operating system. On Unix, do 'ulimit -m 20'.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list