Re: Python object overhead?

2007-03-26 Thread Matt Garman
On 3/23/07, Jack Diederich <[EMAIL PROTECTED]> wrote:
> If you make the record a new style class (inherit from object) you can
> specify the __slots__ attribute on the class.  This eliminates the per
> instance dictionary overhead in exchange for less flexibility.

When you say "new style class", do you mean that the __slots__ feature
is only available in a newer version of Python?  Unfortunately, I'm
stuck on 2.3.4 for this project.

Thanks,
Matt
-- 
http://mail.python.org/mailman/listinfo/python-list



Re: Python object overhead?

2007-03-26 Thread Matt Garman
On 3/23/07, Bjoern Schliessmann
<[EMAIL PROTECTED]> wrote:
> "one blank line" == "EOF"? That's strange. Intended?

In my case, I know my input data doesn't have any blank lines.
However, I'm glad you (and others) clarified the issue, because I
wasn't aware of the better methods for checking for EOF.

> > Example 2: read lines into objects:
> > # begin readobjects.py
> > import sys, time
> > class FileRecord:
> > def __init__(self, line):
> > self.line = line
>
> What's this class intended to do?

Store a line :)  I just wanted to post two runnable examples.  So the
above class's real intention is just to be a (contrived) example.

In the program I actually wrote, my class structure was a bit more
interesting.  After storing the input line, I'd then call split("|")
(to tokenize the line).  Each token would then be assigned to an
member variable.  Some of the member variables turned into ints or
floats as well.

My input data had three record types; all had a few common attributes.
 So I created a parent class and three child classes.

Also, many folks have suggested operating on only one line at a time
(i.e. not storing the whole data set).  Unfortunately, I'm constantly
"looking" forward and backward in the record set while I process the
data (i.e., to process any particular record, I sometimes need to know
the whole contents of the file).  (This is purchased proprietary
vendor data that needs to be converted into our own internal format.)

Finally, for what it's worth: the total run time memory requirements
of my program is roughly 20x the datafile size.  A 200MB file
literally requires 4GB of RAM to effectively process.  Note that, in
addition to the class structure I defined above, I also create two
caches of all the data (two dicts with different keys from the
collection of objects).  This is necessary to ensure the program runs
in a semi-reasonable amount of time.

Thanks to all for your input and suggestions.  I received many more
responses than I expected!

Matt
-- 
http://mail.python.org/mailman/listinfo/python-list


Python object overhead?

2007-03-23 Thread Matt Garman
I'm trying to use Python to work with large pipe ('|') delimited data
files.  The files range in size from 25 MB to 200 MB.

Since each line corresponds to a record, what I'm trying to do is
create an object from each record.  However, it seems that doing this
causes the memory overhead to go up two or three times.

See the two examples below: running each on the same input file
results in 3x the memory usage for Example 2.  (Memory usage is
checked using top.)

This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python
2.3.4 on CentOS 4.4 (64bit).

Is this "just the way it is" or am I overlooking something obvious?

Thanks,
Matt


Example 1: read lines into list:
# begin readlines.py
import sys, time
filedata = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
filedata.append(line)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top
# end readlines.py


Example 2: read lines into objects:
# begin readobjects.py
import sys, time
class FileRecord:
def __init__(self, line):
self.line = line
records = list()
file = open(sys.argv[1])
while True:
line = file.readline()
if len(line) == 0: break # EOF
rec = FileRecord(line)
records.append(rec)
file.close()
print "data read; sleeping 20 seconds..."
time.sleep(20) # gives time to check top
# end readobjects.py
-- 
http://mail.python.org/mailman/listinfo/python-list