Lie Ryan wrote: > On Tue, 16 Dec 2008 12:07:14 -0300, Federico Moreira wrote: > > >> Hi all, >> >> Im parsing a 4.1GB apache log to have stats about how many times an ip >> request something from the server. >> >> The first design of the algorithm was >> >> for line in fileinput.input(sys.argv[1:]): >> ip = line.split()[0] >> if match_counter.has_key(ip): >> match_counter[ip] += 1 >> else: >> match_counter[ip] = 1 >> >> And it took 3min 58 seg to give me the stats >> >> Then i tried a generator solution like >> >> def generateit(): >> for line in fileinput.input(sys.argv[1:]): >> yield line.split()[0] >> >> for ip in generateit(): >> ...the same if sentence >> >> Instead of being faster it took 4 min 20 seg >> >> Should i leave fileinput behind? >> Am i using generators with the wrong aproach? >> > > What's fileinput? A file-like object (unlikely)? Also, what's > fileinput.input? I guess the reason why you don't see much difference > (and is in fact slower) lies in what fileinput.input does. > >
Fileinput is a standard module distributed with Python: >From the manual: 11.2 fileinput -- Iterate over lines from multiple input streams This module implements a helper class and functions to quickly write a loop over standard input or a list of files. The typical use is: import fileinput for line in fileinput.input(): process(line) ... > Generators excels in processing huge data since it doesn't have to create > huge intermediate lists which eats up memory, given an infinite memory, a > generator solution is almost always slower than straight up solution > using lists. However in real life we don't have infinite memory, hogging > our memory with the huge intermediate list would make the system start > swapping, swapping is very slow and is a big hit to performance. This is > the way generator could be faster than list. > > > -- > http://mail.python.org/mailman/listinfo/python-list > -- http://mail.python.org/mailman/listinfo/python-list