On Fri, Mar 23, 2007 at 03:11:35PM -0600, Matt Garman wrote: > I'm trying to use Python to work with large pipe ('|') delimited data > files. The files range in size from 25 MB to 200 MB. > > Since each line corresponds to a record, what I'm trying to do is > create an object from each record. However, it seems that doing this > causes the memory overhead to go up two or three times. > > See the two examples below: running each on the same input file > results in 3x the memory usage for Example 2. (Memory usage is > checked using top.) [snip]
When you are just appending all the lines in a big list your overhead looks like: records = [] for line in file_ob: records.append(line) But when you wrap each line in a small class the overhead is records = [] for line in file_ob: records.append(line) # the actual string records.append(object()) # allocation for the object instance records.append({}) # dictionary for per instance attributes For small strings like dictionary words the overhead over the second is about 5x the overhead of a plain list. Most of it is the per instance dictionary. If you make the record a new style class (inherit from object) you can specify the __slots__ attribute on the class. This eliminates the per instance dictionary overhead in exchange for less flexibility. Another solution would be to only wrap the lines as they are accessed. Make one class that holds a collection of raw records. Have that return a fancy class wrapping each record right before it is used and discarded. class RecordCollection(object): def __init__(self, raw_records): self.raw_records = raw_records def __getitem__(self, i): return Record(self.raw_records[i]) If you are operating on each of the records in serial you only have the class overhead for one at any given time. Hope that helps, -Jack -- http://mail.python.org/mailman/listinfo/python-list