Federico Moreira wrote:
Hi all,
Im parsing a 4.1GB apache log to have stats about how many times an ip
request something from the server.
The first design of the algorithm was
for line in fileinput.input(sys.argv[1:]):
ip = line.split()[0]
if match_counter.has_key(ip):
match_counter[ip] += 1
else:
match_counter[ip] = 1
And it took 3min 58 seg to give me the stats
Then i tried a generator solution like
def generateit():
for line in fileinput.input(sys.argv[1:]):
yield line.split()[0]
for ip in generateit():
...the same if sentence
Instead of being faster it took 4 min 20 seg
Should i leave fileinput behind?
Am i using generators with the wrong aproach?
Your first design is already simple to understand, so I think that using
a generator isn't necessary (and probably isn't worth the cost!).
You might want to try defaultdict instead of dict to see whether that
would be faster:
from collections import defaultdict
match_counter = defaultdict(int)
for line in fileinput.input(sys.argv[1:]):
ip = line.split()[0]
match_counter[ip] += 1
--
http://mail.python.org/mailman/listinfo/python-list