On Mon, 22 Apr 2013 21:19:23 -0400, Rodrick Brown wrote: > I would like some feedback on possible solutions to make this script run > faster. > The system is pegged at 100% CPU and it takes a long time to complete.
Have you profiled the app to see where it is spending all its time? What does "a long time" mean? For instance: "It takes two hours to process a 15KB file" -- you have a problem. "It takes 20 minutes to process a 15GB file" -- and why are you complaining? Or somewhere in the middle... But before profiling, I suggest you clean up the program. For example: if args.inputfile and os.path.exists(args.inputfile): Don't do that. There really isn't any point in checking whether the input file exists, since: 1) Just because it exists doesn't mean you can read it; 2) Just because you can read it doesn't mean it is a valid gzip file; 3) Just because it is a valid gzip file that you can read *now*, doesn't mean that it still will be in 10 milliseconds when you actually try to open the file. A lot can happen in 10ms, or 1ms. The file might be deleted, or overwritten, or permissions changed. Change that to: try: with gzip.open(args.inputfile) as datafile: for line in datafile: and catch the exception if the file doesn't exist, or cannot be read. Which you already do, which just demonstrates that the call to os.path.exists is a waste of effort. Then look for wasted effort like this: line = line.replace('cdn.xxx', 'www.xxx') line = line.replace('cdn.xx', 'www.xx') Surely the first line is redundant, since it would be correctly caught and replaced by the second? Also, you're searching the file system *for every line* in the input file. Pull this outside of the loop and have it run once: if not os.path.exists(outdir): os.makedirs(outdir) Likewise for opening and closing the output file, which you currently open and close it for every line. It only needs to be opened and closed once. If it comes down to micro-optimizations to shave a few microseconds off, consider using string % formatting rather than the format method. But really, if you find yourself shaving microseconds off something that runs for ten minutes, you have to ask why you're bothering. -- Steven -- http://mail.python.org/mailman/listinfo/python-list