Federico Moreira wrote:
Hi all,

Im parsing a 4.1GB apache log to have stats about how many times an ip request something from the server.

The first design of the algorithm was

for line in fileinput.input(sys.argv[1:]):
    ip = line.split()[0]
    if match_counter.has_key(ip):
        match_counter[ip] += 1
    else:
        match_counter[ip] = 1

And it took 3min 58 seg to give me the stats

Then i tried a generator solution like

def generateit():
    for line in fileinput.input(sys.argv[1:]):
        yield line.split()[0]

for ip in generateit():
    ...the same if sentence

Instead of being faster it took 4 min 20 seg

Should i leave fileinput behind?
Am i using generators with the wrong aproach?

Your first design is already simple to understand, so I think that using a generator isn't necessary (and probably isn't worth the cost!).

You might want to try defaultdict instead of dict to see whether that would be faster:

from collections import defaultdict

match_counter = defaultdict(int)
for line in fileinput.input(sys.argv[1:]):
    ip = line.split()[0]
    match_counter[ip] += 1

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to