> FedericoMoreirawrote: > > Hi all, > > > Im parsing a 4.1GB apache log to have stats about how many times an ip > > request something from the server. > > > The first design of the algorithm was > > > for line in fileinput.input(sys.argv[1:]): > > ip = line.split()[0] > > if match_counter.has_key(ip): > > match_counter[ip] += 1 > > else: > > match_counter[ip] = 1 . . . > > Should i leave fileinput behind?
Yes. fileinput is slow because it does a lot more than just read files. > > Am i using generators with the wrong aproach? No need for a generator here. The time is being lost with fileinput, split, and the counting code. Try this instead: match_counter = collections.defaultdict(int) for filename in sys.argv[1:]: for line in open(filename): ip, sep, rest = line.partition(' ') match_counter[ip] += 1 If you're on *nix, there's a fast command line approach: cut -d' ' -f1 filelist | sort | uniq -c -- http://mail.python.org/mailman/listinfo/python-list