On Wed, 21 Mar 2007 17:19:23 +0000, Tom Wright wrote: >> So what's your actual problem that you are trying to solve? > > I have a program which reads a few thousand text files, converts each to a > list (with readlines()), creates a short summary of the contents of each (a > few floating point numbers) and stores this summary in a master list. From > the amount of memory it's using, I think that the lists containing the > contents of each file are kept in memory, even after there are no > references to them. Also, if I tell it to discard the master list and > re-read all the files, the memory use nearly doubles so I presume it's > keeping the lot in memory.
Ah, now we're getting somewhere! Python's caching behaviour with strings is almost certainly going to be different to its caching behaviour with ints. (For example, Python caches short strings that look like identifiers, but I don't believe it caches great blocks of text or short strings which include whitespace.) But again, you haven't really described a problem, just a set of circumstances. Yes, the memory usage doubles. *Is* that a problem in practice? A few thousand 1KB files is one thing; a few thousand 1MB files is an entirely different story. Is the most cost-effective solution to the problem to buy another 512MB of RAM? I don't say that it is. I just point out that you haven't given us any reason to think it isn't. > The program may run through several collections of files, but it only keeps > a reference to the master list of the most recent collection it's looked > at. Obviously, it's not ideal if all the old collections hang around too, > taking up space and causing the machine to swap. Without knowing exactly what your doing with the data, it's hard to tell where the memory is going. I suppose if you are storing huge lists of millions of short strings (words?), they might all be cached. Is there a way you can avoid storing the hypothetical word-lists in RAM, perhaps by writing them straight out to a disk file? That *might* make a difference to the caching algorithm used. Or you could just have an "object leak" somewhere. Do you have any complicated circular references that the garbage collector can't resolve? Lists-of-lists? Trees? Anything where objects aren't being freed when you think they are? Are you holding on to references to lists? It's more likely that your code simply isn't freeing lists you think are being freed than it is that Python is holding on to tens of megabytes of random text. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list