On Tue, 16 Dec 2008 12:07:14 -0300, Federico Moreira wrote:

> Hi all,
> 
> Im parsing a 4.1GB apache log to have stats about how many times an ip
> request something from the server.
> 
> The first design of the algorithm was
> 
> for line in fileinput.input(sys.argv[1:]):
>     ip = line.split()[0]
>     if match_counter.has_key(ip):
>         match_counter[ip] += 1
>     else:
>         match_counter[ip] = 1
> 
> And it took 3min 58 seg to give me the stats
> 
> Then i tried a generator solution like
> 
> def generateit():
>     for line in fileinput.input(sys.argv[1:]):
>         yield line.split()[0]
> 
> for ip in generateit():
>     ...the same if sentence
> 
> Instead of being faster it took 4 min 20 seg
> 
> Should i leave fileinput behind?
> Am i using generators with the wrong aproach?

What's fileinput? A file-like object (unlikely)? Also, what's 
fileinput.input? I guess the reason why you don't see much difference 
(and is in fact slower) lies in what fileinput.input does.

Generators excels in processing huge data since it doesn't have to create 
huge intermediate lists which eats up memory, given an infinite memory, a 
generator solution is almost always slower than straight up solution 
using lists. However in real life we don't have infinite memory, hogging 
our memory with the huge intermediate list would make the system start 
swapping, swapping is very slow and is a big hit to performance. This is 
the way generator could be faster than list.


--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to