Steve, First, many thanks!
Steve Holden wrote: > Alexis Gallagher wrote: >> >> filehandle = open("data",'r',buffering=1000) > > This buffer size seems, shall we say, unadventurous? It's likely to slow > things down considerably, since the filesystem is probably going to > naturally wnt to use a rather larger value. I'd suggest a 64k minumum. Good to know. I should have dug into the docs deeper. Somehow I thought it listed lines not bytes. >> for currentLine in filehandle.readlines(): >> > Note that this is going to read the whole file in to (virtual) memory > before entering the loop. I somehow suspect you'd rather avoid this if > you could. I further suspect your testing has been with smaller files > than 80GB ;-). You might want to consider > Oops! Thanks again. I thought that readlines() was the generator form, based on the docstring comments about the deprecation of xreadlines(). >> So on every iteration I'm processing mutable strings -- this seems >> wrong. What's the best way to speed this up? Can I switch to some fast >> byte-oriented immutable string library? Are there optimizing >> compilers? Are there better ways to prep the file handle? >> > I'm sorry but I am not sure where the mutable strings come in. Python > strings are immutable anyway. Well-known for it. I misspoke. I think was mixing this up with the issue of object-creation overhead for all of the string handling in general. Is this a bottleneck to string processing in python, or is this a hangover from my Java days? I would have thought that dumping the standard string processing libraries in favor of byte manipulation would have been one of the biggest wins. > Of course you leave us in the dark about the nature of > table.markEquivalent as well. markEquivalent() implements union-join (aka, uptrees) to generate equivalence classes. Optimising that was going to be my next task I feel a bit silly for missing the double-processing of everything. Thanks for pointing that out. And I will check out the biopython package. I'm still curious if optimizing compilers are worth examining. For instance, I saw Pyrex and Pysco mentioned on earlier threads. I'm guessing that both this tokenizing and the uptree implementations sound like good candidates for one of those tools, once I shake out these algorithmic problems. alexis -- http://mail.python.org/mailman/listinfo/python-list