On Mon, 22 Apr 2013 12:58:20 +0100, Blind Anagram wrote:

> I would be grateful for any advice people can offer on the fastest way
> to count items in a sub-sequence of a large list.
> 
> I have a list of boolean values that can contain many hundreds of
> millions of elements for which I want to count the number of True values
> in a sub-sequence, one from the start up to some value (say hi).
> 
> I am currently using:
> 
>    sieve[:hi].count(True)
> 
> but I believe this may be costly because it copies a possibly large part
> of the sieve.

Have you timed it? Because Python is a high-level language, it is rarely 
obvious what code will be fast. Yes, sieve[:hi] will copy the first hi 
entries, but that's likely to be fast, basically just a memcopy, unless 
sieve is huge and memory is short. In other words, unless your sieve is 
so huge that the operating system cannot find enough memory for it, 
making a copy is likely to be relatively insignificant.

I've just tried seven different techniques to "optimize" this, and the 
simplest, most obvious technique is by far the fastest. Here are the 
seven different code snippets I measured, with results:


sieve[:hi].count(True)
sum(sieve[:hi])
sum(islice(sieve, hi))
sum(x for x in islice(sieve, hi) if x)
sum(x for x in islice(sieve, hi) if x is True)
sum(1 for x in islice(sieve, hi) if x is True)
len(list(filter(None, islice(sieve, hi))))


Here's the code I used to time them. Just copy and paste into an 
interactive interpreter:

=== cut ===

import random
sieve = [random.random() < 0.5 for i in range(10**7)]

from timeit import Timer
setup = """from __main__ import sieve
from itertools import islice
hi = 7*10**6
"""

t1 = Timer("sieve[:hi].count(True)", setup)
t2 = Timer("sum(sieve[:hi])", setup)
t3 = Timer("sum(islice(sieve, hi))", setup)
t4 = Timer("sum(x for x in islice(sieve, hi) if x)", setup)
t5 = Timer("sum(x for x in islice(sieve, hi) if x is True)", setup)
t6 = Timer("sum(1 for x in islice(sieve, hi) if x is True)", setup)
t7 = Timer("len(list(filter(None, islice(sieve, hi))))", setup)

for t in (t1, t2, t3, t4, t5, t6, t7):
    print( min(t.repeat(number=10)) )

=== cut ===


On my computer, using Python 3.3, here are the timing results I get:

2.3714727330952883
7.96061935601756
7.230580328032374
10.080201900098473
11.544118068180978
9.216834562830627
3.499635103158653


Times shown are in seconds, and are for the best of three trials, each 
trial having 10 repetitions of the code being tested.

As you can see, clever tricks using sum are horrible pessimisations, the 
only thing that comes close to the obvious solution is the one using 
filter.

Although I have only tested a list with ten million items, not hundreds 
of millions, I don't expect that the results will be significantly 
different if you use a larger list, unless you are very short of memory.

[...]
> Can I assume that one possible solution would be to sub-class list and
> create a C based extension to provide list.count(value, limit)?

Of course. But don't optimize this until you know that you *need* to 
optimize it. Is it really a bottleneck in your code? There's no point in 
saving the 0.1 second it takes to copy the list if it takes 2 seconds to 
count the items regardless.


> Are there any other solutions that will avoid copying a large part of
> the list?

Yes, but they're slower.

Perhaps a better solution might be to avoid counting anything. If you can 
keep a counter, and each time you add a value to the list you update the 
counter, then getting the number of True values will be instantaneous.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to