Thanks for the comments, > > (First, I had to add timing code to ReadClasses: the code you posted > doesn't include them, and only shows timings for ReadLines.) > > Your program uses quite a bit of memory. I guess it gets harder and > harder to allocate the required amounts of memory.
Well, I guess there could be something in that, but why is there a significant increase after the first time? And after that, single- trip time pretty much flattens out. No more obvious increases. > > If I change this line in ReadClasses: > > built_classes[len(built_classes)] = HugeClass(long_line) > > to > > dummy = HugeClass(long_line) > > then both times the files are read and your data structures are built, > but after each run the data structure is freed. The result is that > both > runs are equally fast. Isnt't the 'del LINES' supposed to achieve the same thing? And really, reading 30MB files should not be such a problem, right? (I'm also running with 1GB of RAM.) > I'm not sure how to speed things up here... you're doing much > processing > on a lot of small chunks of data. I have a number of observations and > possible improvements though, and some might even speed things up a > bit. Cool thanks, let's go over them. > > You read the files, but don't use the contents; instead you use > long_line over and over. I suppose you do that because this is a test, > not your actual code? Yeah ;-) (Do I notice a lack of trust in the responses I get? Should I not mention 'newbie'?) Let's get a couple of things out of the way: - I do know about meaningful variable names and case-conventions, but ... First of all I also have to live with inherited code (I don't like people shouting in their code either), and secondly (all the itemx) most of these members normally _have_ descriptive names but I'm not supposed to copy-paste the original code to any newsgroups. - I also know that a plain 'return' in python does not do anything but I happen to like them. Same holds for the sys.exit() call. - The __init__ methods normally actually do something: they initialise some member variables to meaningful values (by calling the clear() method, actually). - The __clear__ method normally brings objects back into a well- defined 'empty' state. - The __del__ methods are actually needed in this case (well, in the _real_ code anyway). The python code loads a module written in C++ and some of the member variables actually point to C++ objects created dynamically, so one actually has to call their destructors before unbinding the python var. I tried to get things down to as small as possible, but when I found out that the size of the classes seems to contribute to the issue (removing enough member variables will bring you to a point where all of a sudden the speed increases a factor ten, there seems to be some breakpoint depending on the size of the classes) I could not simply remove all members but had to give them funky names. I kept the main structure of things, though, to see if that would solicit comments. (And it did...) > > > In a number of cases, you use a dict like this: > > built_classes = {} > for i in LINES: > built_classes[len(built_classes)] = ... > > So you're using the indices 0, 1, 2, ... as the keys. That's not what > dictionaries are made for; lists are much better for that: > > built_classes = [] > for i in LINES: > built_classes.append(...) Yeah, I inherited that part... > > Your readLines() function reads a whole file into memory. If you're > working with large files, that's not such a good idea. It's better to > load one line at a time into memory and work on that. I would even > completely remove readLines() and restructure ReadClasses() like this: Actually, part of what I removed was the real reason why readLines() is there at all: it reads files in blocks of (at most) some_number lines, and keeps track of the line offset in the file. I kept this structure hoping that someone would point out something obvious like some internal buffer going out of scope or whatever. All right, thanks for the tips. I guess the issue itself is still open, though. Cheers, Jeroen Jeroen Hegeman jeroen DOT hegeman AT gmail DOT com WARNING: This message may contain classified information. Immediately burn this message after reading. -- http://mail.python.org/mailman/listinfo/python-list