On Dec 12, 5:48 pm, [EMAIL PROTECTED] wrote: > Hi, I am pretty new to Python and trying to use it for a relatively > simple problem of loading a 5 million line text file and converting it > into a few binary files. The text file has a fixed format (like a > punchcard). The columns contain integer, real, and date values. The > output files are the same values in binary. I have to parse the values > and write the binary tuples out into the correct file based on a given > column. It's a little more involved but that's not important. > > I have a C++ prototype of the parsing code and it loads a 5 Mline file > in about a minute. I was expecting the Python version to be 3-4 times > slower and I can live with that. Unfortunately, it's 20 times slower > and I don't see how I can fix that. > > The fundamental difference is that in C++, I create a single object (a > line buffer) that's reused for each input line and column values are > extracted straight from that buffer without creating new string > objects. In python, new objects must be created and destroyed by the > million which must incur serious memory management overhead. > > Correct me if I am wrong but > > 1) for line in file: ... > will create a new string object for every input line > > 2) line[start:end] > will create a new string object as well > > 3) int(time.mktime(time.strptime(s, "%m%d%y%H%M%S"))) > will create 10 objects (since struct_time has 8 fields) > > 4) a simple test: line[i:j] + line[m:n] in hash > creates 3 strings and there is no way to avoid that. > > I thought arrays would help but I can't load an array without creating > a string first: ar(line, start, end) is not supported. > > I hope I am missing something. I really like Python but if there is no > way to process data efficiently, that seems to be a problem.
20 times slower because of garbage collection sounds kinda fishy. Posting some actual code usually helps; it's hard to tell for sure otherwise. George -- http://mail.python.org/mailman/listinfo/python-list