percent faster than format()? (was: Re: optomizations)
Am 23.04.2013 06:00, schrieb Steven D'Aprano: If it comes down to micro-optimizations to shave a few microseconds off, consider using string % formatting rather than the format method. Why? I don't see any obvious difference between the two... Greetings! Uli -- http://mail.python.org/mailman/listinfo/python-list
Re: percent faster than format()? (was: Re: optomizations)
On Tue, Apr 23, 2013 at 9:46 AM, Ulrich Eckhardt ulrich.eckha...@dominolaser.com wrote: Am 23.04.2013 06:00, schrieb Steven D'Aprano: If it comes down to micro-optimizations to shave a few microseconds off, consider using string % formatting rather than the format method. Why? I don't see any obvious difference between the two... Greetings! Uli -- http://mail.python.org/mailman/listinfo/python-list $ python -m timeit a = '{0} {1} {2}'.format(1, 2, 42) 100 loops, best of 3: 0.824 usec per loop $ python -m timeit a = '%s %s %s' % (1, 2, 42) 1000 loops, best of 3: 0.0286 usec per loop -- Kwpolska http://kwpolska.tk | GPG KEY: 5EAAEA16 stop html mail| always bottom-post http://asciiribbon.org| http://caliburn.nl/topposting.html -- http://mail.python.org/mailman/listinfo/python-list
Re: optomizations
On Tue, Apr 23, 2013 at 11:53 AM, Roy Smith r...@panix.com wrote: In article mailman.944.1366680414.3114.python-l...@python.org, Rodrick Brown rodrick.br...@gmail.com wrote: I would like some feedback on possible solutions to make this script run faster. If I had to guess, I would think this stuff: line = line.replace('mediacdn.xxx.com', 'media.xxx.com') line = line.replace('staticcdn.xxx.co.uk', ' static.xxx.co.uk') line = line.replace('cdn.xxx', 'www.xxx') line = line.replace('cdn.xxx', 'www.xxx') line = line.replace('cdn.xx', 'www.xx') siteurl = line.split()[6].split('/')[2] line = re.sub(r'\bhttps?://%s\b' % siteurl, , line, 1) You make 6 copies of every line. That's slow. One of those is a regular expression substitution, which is also likely to be a hot-spot. But definitely profile. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: percent faster than format()? (was: Re: optomizations)
On Tue, 23 Apr 2013 09:46:53 +0200, Ulrich Eckhardt wrote: Am 23.04.2013 06:00, schrieb Steven D'Aprano: If it comes down to micro-optimizations to shave a few microseconds off, consider using string % formatting rather than the format method. Why? I don't see any obvious difference between the two... Possibly the state of the art has changed since then, but some years ago % formatting was slightly faster than the format method. Let's try it and see: # Using Python 3.3. py from timeit import Timer py setup = a = 'spam'; b = 'ham'; c = 'eggs' py t1 = Timer('%s, %s and %s for breakfast' % (a, b, c), setup) py t2 = Timer('{}, {} and {} for breakfast'.format(a, b, c), setup) py print(min(t1.repeat())) 0.8319804421626031 py print(min(t2.repeat())) 1.2395259491167963 Looks like the format method is about 50% slower. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: percent faster than format()? (was: Re: optomizations)
On Wed, Apr 24, 2013 at 12:36 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: # Using Python 3.3. py from timeit import Timer py setup = a = 'spam'; b = 'ham'; c = 'eggs' py t1 = Timer('%s, %s and %s for breakfast' % (a, b, c), setup) py t2 = Timer('{}, {} and {} for breakfast'.format(a, b, c), setup) py print(min(t1.repeat())) 0.8319804421626031 py print(min(t2.repeat())) 1.2395259491167963 Looks like the format method is about 50% slower. Figures on my hardware are (naturally) different, with a similar (but slightly more pronounced) difference: sys.version '3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit (Intel)]' print(min(t1.repeat())) 1.4841416995735415 print(min(t2.repeat())) 2.5459869899666074 t3 = Timer(a+', '+b+' and '+c+' for breakfast', setup) print(min(t3.repeat())) 1.5707538248576327 t4 = Timer(''.join([a, ', ', b, ' and ', c, ' for breakfast']), setup) print(min(t4.repeat())) 1.5026834416105999 So on the face of it, format() is slower than everything else by a good margin... until you note that repeat() is doing one million iterations, so those figures are effectively in microseconds. Yeah, I think I can handle a couple of microseconds. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
optomizations
I would like some feedback on possible solutions to make this script run faster. The system is pegged at 100% CPU and it takes a long time to complete. #!/usr/bin/env python import gzip import re import os import sys from datetime import datetime import argparse if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('-f', dest='inputfile', type=str, help='data file to parse') parser.add_argument('-o', dest='outputdir', type=str, default=os.getcwd(), help='Output directory') args = parser.parse_args() if len(sys.argv[1:]) 1: parser.print_usage() sys.exit(-1) print(args) if args.inputfile and os.path.exists(args.inputfile): try: with gzip.open(args.inputfile) as datafile: for line in datafile: line = line.replace('mediacdn.xxx.com', 'media.xxx.com') line = line.replace('staticcdn.xxx.co.uk', ' static.xxx.co.uk') line = line.replace('cdn.xxx', 'www.xxx') line = line.replace('cdn.xxx', 'www.xxx') line = line.replace('cdn.xx', 'www.xx') siteurl = line.split()[6].split('/')[2] line = re.sub(r'\bhttps?://%s\b' % siteurl, , line, 1) (day, month, year, hour, minute, second) = (line.split()[3]).replace('[','').replace(':','/').split('/') datelog = '{} {} {}'.format(month, day, year) dateobj = datetime.strptime(datelog, '%b %d %Y') outfile = '{}{}{}_combined.log'.format(dateobj.year, dateobj.month, dateobj.day) outdir = (args.outputdir + os.sep + siteurl) if not os.path.exists(outdir): os.makedirs(outdir) with open(outdir + os.sep + outfile, 'w+') as outf: outf.write(line) except IOError, err: sys.stderr.write(Error unable to read or extract inputfile: {} {}\n.format(args.inputfile, err)) sys.exit(-1) -- http://mail.python.org/mailman/listinfo/python-list
Re: optomizations
On Tue, Apr 23, 2013 at 11:19 AM, Rodrick Brown rodrick.br...@gmail.com wrote: with gzip.open(args.inputfile) as datafile: for line in datafile: outfile = '{}{}{}_combined.log'.format(dateobj.year, dateobj.month, dateobj.day) outdir = (args.outputdir + os.sep + siteurl) with open(outdir + os.sep + outfile, 'w+') as outf: outf.write(line) You're opening files and closing them again for every line. This wouldn't cause you to spin the CPU (more likely it'd thrash the hard disk - unless you have an SSD), but it is certainly an optimization target. Can you know in advance what files you need? If not, I'd try something like this: outf = {} # Might want a better name though . outfile = ... if outfile not in outf: os.makedirs(...) outf[outfile] = open(...) outf[outfile].write(line) for f in outf.values(): f.close() Open files only as needed, close 'em all at the end. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: optomizations
In article mailman.944.1366680414.3114.python-l...@python.org, Rodrick Brown rodrick.br...@gmail.com wrote: I would like some feedback on possible solutions to make this script run faster. If I had to guess, I would think this stuff: line = line.replace('mediacdn.xxx.com', 'media.xxx.com') line = line.replace('staticcdn.xxx.co.uk', ' static.xxx.co.uk') line = line.replace('cdn.xxx', 'www.xxx') line = line.replace('cdn.xxx', 'www.xxx') line = line.replace('cdn.xx', 'www.xx') siteurl = line.split()[6].split('/')[2] line = re.sub(r'\bhttps?://%s\b' % siteurl, , line, 1) You make 6 copies of every line. That's slow. But I'm also going to quote something I wrote here a couple of months back: I've been doing some log analysis. It's been taking a grovelingly long time, so I decided to fire up the profiler and see what's taking so long. I had a pretty good idea of where the ONLY TWO POSSIBLE hotspots might be (looking up IP addresses in the geolocation database, or producing some pretty pictures using matplotlib). It was just a matter of figuring out which it was. As with most attempts to out-guess the profiler, I was totally, absolutely, and embarrassingly wrong. So, my real advice to you is to fire up the profiler and see what it says. -- http://mail.python.org/mailman/listinfo/python-list
Re: optomizations
On 23/04/2013 02:19, Rodrick Brown wrote: I would like some feedback on possible solutions to make this script run faster. The system is pegged at 100% CPU and it takes a long time to complete. #!/usr/bin/env python import gzip import re import os import sys from datetime import datetime import argparse if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('-f', dest='inputfile', type=str, help='data file to parse') parser.add_argument('-o', dest='outputdir', type=str, default=os.getcwd(), help='Output directory') args = parser.parse_args() if len(sys.argv[1:]) 1: parser.print_usage() sys.exit(-1) print(args) if args.inputfile and os.path.exists(args.inputfile): try: with gzip.open(args.inputfile) as datafile: for line in datafile: line = line.replace('mediacdn.xxx.com', 'media.xxx.com') line = line.replace('staticcdn.xxx.co.uk', 'static.xxx.co.uk') These next 2 lines are duplicates; the second will have no effect (I think!). line = line.replace('cdn.xxx', 'www.xxx') line = line.replace('cdn.xxx', 'www.xxx') Won't the next line also do the work of the preceding 2 lines? line = line.replace('cdn.xx', 'www.xx') siteurl = line.split()[6].split('/')[2] line = re.sub(r'\bhttps?://%s\b' % siteurl, , line, 1) (day, month, year, hour, minute, second) = (line.split()[3]).replace('[','').replace(':','/').split('/') datelog = '{} {} {}'.format(month, day, year) dateobj = datetime.strptime(datelog, '%b %d %Y') outfile = '{}{}{}_combined.log'.format(dateobj.year, dateobj.month, dateobj.day) outdir = (args.outputdir + os.sep + siteurl) if not os.path.exists(outdir): os.makedirs(outdir) with open(outdir + os.sep + outfile, 'w+') as outf: outf.write(line) except IOError, err: sys.stderr.write(Error unable to read or extract inputfile: {} {}\n.format(args.inputfile, err)) sys.exit(-1) I wonder whether it'll make a difference if you read a chunk at a time (datafile.read(chunk_size) + datafile.readline() to ensure you have complete lines), perform the replacements on it (so that you're working on several lines in one go), and then split it into lines for further processing. Another thing you could try caching the result of parsing the date, using (month, day, year) the key and outfile as the value in a dict. A third thing you could try is not writing a file for every line (doesn't the 'w+' mode truncate the file?), but save the output for each chunk (see first suggestion) and then write the files afterwards, at the end of the chunk. -- http://mail.python.org/mailman/listinfo/python-list
Re: optomizations
On Mon, Apr 22, 2013 at 6:53 PM, Roy Smith r...@panix.com wrote: So, my real advice to you is to fire up the profiler and see what it says. I agree. Fire up a line-oriented profiler and only then start trying to improve the hot spots. -- http://mail.python.org/mailman/listinfo/python-list
Re: optomizations
On Mon, 22 Apr 2013 21:19:23 -0400, Rodrick Brown wrote: I would like some feedback on possible solutions to make this script run faster. The system is pegged at 100% CPU and it takes a long time to complete. Have you profiled the app to see where it is spending all its time? What does a long time mean? For instance: It takes two hours to process a 15KB file -- you have a problem. It takes 20 minutes to process a 15GB file -- and why are you complaining? Or somewhere in the middle... But before profiling, I suggest you clean up the program. For example: if args.inputfile and os.path.exists(args.inputfile): Don't do that. There really isn't any point in checking whether the input file exists, since: 1) Just because it exists doesn't mean you can read it; 2) Just because you can read it doesn't mean it is a valid gzip file; 3) Just because it is a valid gzip file that you can read *now*, doesn't mean that it still will be in 10 milliseconds when you actually try to open the file. A lot can happen in 10ms, or 1ms. The file might be deleted, or overwritten, or permissions changed. Change that to: try: with gzip.open(args.inputfile) as datafile: for line in datafile: and catch the exception if the file doesn't exist, or cannot be read. Which you already do, which just demonstrates that the call to os.path.exists is a waste of effort. Then look for wasted effort like this: line = line.replace('cdn.xxx', 'www.xxx') line = line.replace('cdn.xx', 'www.xx') Surely the first line is redundant, since it would be correctly caught and replaced by the second? Also, you're searching the file system *for every line* in the input file. Pull this outside of the loop and have it run once: if not os.path.exists(outdir): os.makedirs(outdir) Likewise for opening and closing the output file, which you currently open and close it for every line. It only needs to be opened and closed once. If it comes down to micro-optimizations to shave a few microseconds off, consider using string % formatting rather than the format method. But really, if you find yourself shaving microseconds off something that runs for ten minutes, you have to ask why you're bothering. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: optomizations
On Tue, Apr 23, 2013 at 2:00 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: Also, you're searching the file system *for every line* in the input file. Pull this outside of the loop and have it run once: if not os.path.exists(outdir): os.makedirs(outdir) Likewise for opening and closing the output file, which you currently open and close it for every line. It only needs to be opened and closed once. The outdir depends on the line, though. Hence my suggestion to retain the open files in a dictionary. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: optomizations
On Apr 22, 2013, at 11:18 PM, Dan Stromberg drsali...@gmail.com wrote: On Mon, Apr 22, 2013 at 6:53 PM, Roy Smith r...@panix.com wrote: So, my real advice to you is to fire up the profiler and see what it says. I agree. Fire up a line-oriented profiler and only then start trying to improve the hot spots. Got a doc or URL I have no experience working with python profilers. -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: optomizations
On Tue, 23 Apr 2013 00:20:59 -0400, Rodrick Brown wrote: Got a doc or URL I have no experience working with python profilers. https://duckduckgo.com/html/?q=python%20profiler This is also good: http://pymotw.com/2/profile/ -- Steven -- http://mail.python.org/mailman/listinfo/python-list