Re: Generator slower than iterator?

2008-12-16 Thread MRAB

Federico Moreira wrote:

Hi all,

Im parsing a 4.1GB apache log to have stats about how many times an ip 
request something from the server.


The first design of the algorithm was

for line in fileinput.input(sys.argv[1:]):
ip = line.split()[0]
if match_counter.has_key(ip):
match_counter[ip] += 1
else:
match_counter[ip] = 1

And it took 3min 58 seg to give me the stats

Then i tried a generator solution like

def generateit():
for line in fileinput.input(sys.argv[1:]):
yield line.split()[0]

for ip in generateit():
...the same if sentence

Instead of being faster it took 4 min 20 seg

Should i leave fileinput behind?
Am i using generators with the wrong aproach?

Your first design is already simple to understand, so I think that using 
a generator isn't necessary (and probably isn't worth the cost!).


You might want to try defaultdict instead of dict to see whether that 
would be faster:


from collections import defaultdict

match_counter = defaultdict(int)
for line in fileinput.input(sys.argv[1:]):
ip = line.split()[0]
match_counter[ip] += 1

--
http://mail.python.org/mailman/listinfo/python-list


Re: Generator slower than iterator?

2008-12-16 Thread Lie Ryan
On Tue, 16 Dec 2008 12:07:14 -0300, Federico Moreira wrote:

> Hi all,
> 
> Im parsing a 4.1GB apache log to have stats about how many times an ip
> request something from the server.
> 
> The first design of the algorithm was
> 
> for line in fileinput.input(sys.argv[1:]):
> ip = line.split()[0]
> if match_counter.has_key(ip):
> match_counter[ip] += 1
> else:
> match_counter[ip] = 1
> 
> And it took 3min 58 seg to give me the stats
> 
> Then i tried a generator solution like
> 
> def generateit():
> for line in fileinput.input(sys.argv[1:]):
> yield line.split()[0]
> 
> for ip in generateit():
> ...the same if sentence
> 
> Instead of being faster it took 4 min 20 seg
> 
> Should i leave fileinput behind?
> Am i using generators with the wrong aproach?

What's fileinput? A file-like object (unlikely)? Also, what's 
fileinput.input? I guess the reason why you don't see much difference 
(and is in fact slower) lies in what fileinput.input does.

Generators excels in processing huge data since it doesn't have to create 
huge intermediate lists which eats up memory, given an infinite memory, a 
generator solution is almost always slower than straight up solution 
using lists. However in real life we don't have infinite memory, hogging 
our memory with the huge intermediate list would make the system start 
swapping, swapping is very slow and is a big hit to performance. This is 
the way generator could be faster than list.


--
http://mail.python.org/mailman/listinfo/python-list


Re: Generator slower than iterator?

2008-12-16 Thread Lie Ryan
On Tue, 16 Dec 2008 12:07:14 -0300, Federico Moreira wrote:

> Hi all,
> 
> Im parsing a 4.1GB apache log to have stats about how many times an ip
> request something from the server.
> 
> The first design of the algorithm was
> 
> for line in fileinput.input(sys.argv[1:]):
> ip = line.split()[0]
> if match_counter.has_key(ip):
> match_counter[ip] += 1
> else:
> match_counter[ip] = 1

nitpick:
dict.has_key is usually replaced with 
if ip in match_counter: ...

also, after investigating your code further, I see that you've 
unnecessarily used generators, the first code is simpler and you've not 
avoided any creation of huge intermediate list by using the generator 
this way. You won't get any performance improvement with this, and 
instead get a performance hit due to function overhead and name look up.

--
http://mail.python.org/mailman/listinfo/python-list


Re: Generator slower than iterator?

2008-12-16 Thread Gary Herron
Lie Ryan wrote:
> On Tue, 16 Dec 2008 12:07:14 -0300, Federico Moreira wrote:
>
>   
>> Hi all,
>>
>> Im parsing a 4.1GB apache log to have stats about how many times an ip
>> request something from the server.
>>
>> The first design of the algorithm was
>>
>> for line in fileinput.input(sys.argv[1:]):
>> ip = line.split()[0]
>> if match_counter.has_key(ip):
>> match_counter[ip] += 1
>> else:
>> match_counter[ip] = 1
>>
>> And it took 3min 58 seg to give me the stats
>>
>> Then i tried a generator solution like
>>
>> def generateit():
>> for line in fileinput.input(sys.argv[1:]):
>> yield line.split()[0]
>>
>> for ip in generateit():
>> ...the same if sentence
>>
>> Instead of being faster it took 4 min 20 seg
>>
>> Should i leave fileinput behind?
>> Am i using generators with the wrong aproach?
>> 
>
> What's fileinput? A file-like object (unlikely)? Also, what's 
> fileinput.input? I guess the reason why you don't see much difference 
> (and is in fact slower) lies in what fileinput.input does.
>
>   

Fileinput is a standard module distributed with Python:


>From the manual:

11.2 fileinput -- Iterate over lines from multiple input streams

This module implements a helper class and functions to quickly write a
loop over standard input or a list of files.

The typical use is:

import fileinput
for line in fileinput.input():
process(line)

 ...


> Generators excels in processing huge data since it doesn't have to create 
> huge intermediate lists which eats up memory, given an infinite memory, a 
> generator solution is almost always slower than straight up solution 
> using lists. However in real life we don't have infinite memory, hogging 
> our memory with the huge intermediate list would make the system start 
> swapping, swapping is very slow and is a big hit to performance. This is 
> the way generator could be faster than list.
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>   

--
http://mail.python.org/mailman/listinfo/python-list


Re: Generator slower than iterator?

2008-12-16 Thread bearophileHUGS
MRAB:
> from collections import defaultdict
> match_counter = defaultdict(int)
> for line in fileinput.input(sys.argv[1:]):
>  ip = line.split()[0]
>  match_counter[ip] += 1

This can be a little faster still:

match_counter = defaultdict(int)
for line in fileinput.input(sys.argv[1:]):
ip = line.split(None, 1)[0]
match_counter[ip] += 1

Bye,
bearophile
--
http://mail.python.org/mailman/listinfo/python-list


Re: Generator slower than iterator?

2008-12-16 Thread Federico Moreira
The defaultdict option looks faster than the standard dict (20 secs aprox).

Now i have:

#
import fileinput
import sys
from collections import defaultdict

match_counter = defaultdict(int)

for line in fileinput.input(sys.argv[1:]):
match_counter[line.split()[0]] += 1
#

About generators, now i get the idea of when use them and when not.

Thanks MRAB, Lie and Gary.
Now not only is faster but also cleaner and easy to understand =)
--
http://mail.python.org/mailman/listinfo/python-list


Re: Generator slower than iterator?

2008-12-16 Thread rdmurray
Quoth Lie Ryan :
> On Tue, 16 Dec 2008 12:07:14 -0300, Federico Moreira wrote:
> 
> > Hi all,
> > 
> > Im parsing a 4.1GB apache log to have stats about how many times an ip
> > request something from the server.
> > 
> > The first design of the algorithm was
> > 
> > for line in fileinput.input(sys.argv[1:]):
> > ip = line.split()[0]
> > if match_counter.has_key(ip):
> > match_counter[ip] += 1
> > else:
> > match_counter[ip] = 1
> 
> nitpick:
> dict.has_key is usually replaced with 
> if ip in match_counter: ...

I'd say it's just slightly more than a nitpick nowadays:  since Python
3.0 does not support has_key, it's time to get used to not using it :)

--RDM

--
http://mail.python.org/mailman/listinfo/python-list


Re: Generator slower than iterator?

2008-12-16 Thread Federico Moreira
2008/12/16 

> Python 3.0 does not support has_key, it's time to get used to not using it
> :)
>

Good to know

line.split(None, 1)[0] really speeds up the proccess

Thanks again.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Generator slower than iterator?

2008-12-16 Thread Arnaud Delobelle
bearophileh...@lycos.com writes:

> This can be a little faster still:
>
> match_counter = defaultdict(int)
> for line in fileinput.input(sys.argv[1:]):
> ip = line.split(None, 1)[0]
> match_counter[ip] += 1
>
> Bye,
> bearophile

Or maybe (untested):

match_counter = defaultdict(int)
for line in fileinput.input(sys.argv[1:]):
ip = line[:line.index(' ')]
match_counter[ip] += 1

Or even (untested, I've never tried this ;):

from itertools import count
match_counter = defaultdict(count)
for line in fileinput.input(sys.argv[1:]):
ip = line[:line.index(' ')]
match_counter[ip].next()

match_total = dict((key, val()) for key, val in match_counter.iteritems())

Obviously the variable 'ip' could be removed in all cases.

-- 
Arnaud
--
http://mail.python.org/mailman/listinfo/python-list


Re: Generator slower than iterator?

2008-12-16 Thread Arnaud Delobelle
Arnaud Delobelle  writes:

> match_total = dict((key, val()) for key, val in match_counter.iteritems())

Sorry I meant

match_total = dict((key, val.next())
   for key, val in match_counter.iteritems())

-- 
Arnaud
--
http://mail.python.org/mailman/listinfo/python-list


Re: Generator slower than iterator?

2008-12-19 Thread Raymond Hettinger
> FedericoMoreirawrote:
> > Hi all,
>
> > Im parsing a 4.1GB apache log to have stats about how many times an ip
> > request something from the server.
>
> > The first design of the algorithm was
>
> > for line in fileinput.input(sys.argv[1:]):
> >     ip = line.split()[0]
> >     if match_counter.has_key(ip):
> >         match_counter[ip] += 1
> >     else:
> >         match_counter[ip] = 1
 . . .
> > Should i leave fileinput behind?

Yes.  fileinput is slow because it does a lot more than just read
files.

> > Am i using generators with the wrong aproach?

No need for a generator here.  The time is being lost with fileinput,
split, and the counting code.  Try this instead:

match_counter = collections.defaultdict(int)
for filename in sys.argv[1:]:
for line in open(filename):
ip, sep, rest = line.partition(' ')
match_counter[ip] += 1

If you're on *nix, there's a fast command line approach:

cut -d' ' -f1 filelist | sort | uniq -c






--
http://mail.python.org/mailman/listinfo/python-list


Re: Generator slower than iterator?

2008-12-19 Thread Federico Moreira
Great, 2min 34 secs with the open method =)

but why?

ip, sep, rest = line.partition(' ')
   match_counter[ip] += 1

instead of

match_counter[line.strip()[0]] += 1

strip really takes more time than partition?

I'm having the same results with both of them right now.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Generator slower than iterator?

2008-12-19 Thread MRAB

Federico Moreira wrote:

Great, 2min 34 secs with the open method =)

but why?

ip, sep, rest = line.partition(' ')
   match_counter[ip] += 1

instead of

match_counter[line.strip()[0]] += 1

strip really takes more time than partition?

I'm having the same results with both of them right now.


I think you meant split(), not strip().

split() might split the string into many parts, but you want only the 
first part (at most 1 split), so the extra splits are unnecessary and 
waste time. split(None, 1) and partition(' ') do at most 1 split, so 
they don't do unnecessary work.

--
http://mail.python.org/mailman/listinfo/python-list


Re: Generator slower than iterator?

2008-12-19 Thread Federico Moreira
Yep i meant split sorry.

Thanks for the answer!
--
http://mail.python.org/mailman/listinfo/python-list


Re: Generator slower than iterator?

2008-12-19 Thread Arnaud Delobelle
MRAB  writes:

> Federico Moreira wrote:
>> Great, 2min 34 secs with the open method =)
>>
>> but why?
>>
>> ip, sep, rest = line.partition(' ')
>>match_counter[ip] += 1
>>
>> instead of
>>
>> match_counter[line.strip()[0]] += 1
>>
>> strip really takes more time than partition?
>>
>> I'm having the same results with both of them right now.
>>
> I think you meant split(), not strip().
>
> split() might split the string into many parts, but you want only the
> first part (at most 1 split), so the extra splits are unnecessary and
> waste time. split(None, 1) and partition(' ') do at most 1 split, so
> they don't do unnecessary work.

Here's a very quick test:

>>> timeit.Timer("a='fsdjflksjdf as fsd jkfl sdj'; b=a.partition(' 
>>> ')[0]").timeit(500)
2.9831948280334473
>>> timeit.Timer("a='fsdjflksjdf as fsd jkfl sdj'; b=a[:a.find(' 
>>> ')]").timeit(500)
3.0122888088226318
>>> timeit.Timer("a='fsdjflksjdf as fsd jkfl sdj'; b=a.split(None, 
>>> 1)[0]").timeit(500)
4.5993850231170654
>>> timeit.Timer("a='fsdjflksjdf as fsd jkfl sdj'; 
>>> b=a.split()[0]").timeit(500)
5.2067980766296387

So .split(None, 1) is quite a bit slower than .partition(' ') apparently.

-- 
Arnaud
--
http://mail.python.org/mailman/listinfo/python-list


Re: Generator slower than iterator?

2008-12-20 Thread Federico Moreira
Wow, thanks again =)
--
http://mail.python.org/mailman/listinfo/python-list