subject:"List Count"

[issue29756] List count() counts True as 1

2017-03-08 Thread Barry A. Warsaw


Barry A. Warsaw added the comment:

bools are subclasses of int and False and True have integer equivalents:

https://docs.python.org/3/library/stdtypes.html#bltin-boolean-values

--
nosy: +barry
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29756] List count() counts True as 1

2017-03-08 Thread Alexander Todorov


New submission from Alexander Todorov:

When using list.count() I get the following results

>>> [1, 2, 3].count(1)
1
>>> [1, 2, 3, True].count(2)
1
>>> [1, 2, 3, True].count(True)
2
>>> [1, 2, 3, True].count(1)
2

as you can see True is considered the same as 1.  The documentation for the 
count method says:

count(...)
L.count(value) -> integer -- return number of occurrences of value

so IMO the above behavior is wrong. Seeing this on a RHEL 7 system with 
Python 3.5.1 and 2.7.5

--
messages: 289235
nosy: Alexander Todorov
priority: normal
severity: normal
status: open
title: List count() counts True as 1
versions: Python 2.7, Python 3.5

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue29756>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Re: List Count

2013-04-24 Thread Blind Anagram

On 24/04/2013 02:59, Steven D'Aprano wrote:
 On Tue, 23 Apr 2013 17:57:17 +0100, Blind Anagram wrote:

[snip]

 In my opinion, it is more important to be efficient for large sieves, not 
 small. As they say, for small N, everything is fast. Nobody is going to 
 care about the difference between small-N taking 1ms or 10ms, but they 
 will care about the difference between big-N taking 1 minute or 1 hour. 
 So, as a general rule, optimize the expensive cases, and if the cheap 
 cases are a little less cheap than they otherwise would have been, nobody 
 will care or notice.

I agree in general but it is often the case that more sophisticated
algorithms only gain traction over simpler ones at much higher points
than might be expected from a simple analysis.  In my experience a naive
sieve is an efficient solution for a general purpose sieve class
primarily intended for situations where the sieve length can be large
but not huge.

As I think you have pointed out, memory is cheap and the memory
operations needed to do copying and counting operations are fast. So
using up memory is not normally an issue.  I have just found a situation
where a copy can have a serious impact on performance in an admittedly
limiting, minority use case.   It the fact that this copy is not, in
principle, necessary, that I find disappointing.

[snip]
 In this case, I would say that adding a limit argument to list.count is 
 *absolutely not* worthwhile, because it is insufficiently general. To be 
 general enough to be worthwhile, you would have to add three arguments:
 
 list.count(value, start=0, end=None, step=1)
 
 But even there I don't think it's general enough to cover the costs. I 
 believe that a better solution would be to create list views to offer a 
 subset of the list without actually copying.

I don't know a great deal about Python internals but I suspect that
these solutions would offer more but also cost more.  So where the
balance of advantages would lie is unclear.  But I am not pushing for a
particular (or, indeed, any) solution.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-23 Thread Blind Anagram

On 23/04/2013 00:06, Oscar Benjamin wrote:
 On 22 April 2013 22:25, Blind Anagram blindanag...@nowhere.org wrote:
 On 22/04/2013 21:18, Oscar Benjamin wrote:
 On 22 April 2013 17:38, Blind Anagram blindanag...@nowhere.org wrote:

 I also wondered whether I had missed any obvious way of avoiding the
 slicing cost (intellectually it seemed wrong to me to have to copy the
 list in order to count items within it).
  [snip]

 I have looked at solutions based on listing primes and here I have found
 that they are very much slower than my existing solution when the sieve
 is not large (which is the majority use case).
 
 What matters is not so much the size of the sieve but the size of the
 interval you want to query. You say that slicing cost is somehow
 significant which suggests to me that it's not a small interval. An
 approach using a sorted list of primes and bisect would have a cost
 that is independent of the size of the interval (and depends only
 logarithmically on the size of the sieve).

The issue here is that the prime number count is only one of the
features of the class so I would have to essentially rewrite it to use
the technique you suggest.

And I found in my experiments that creating the sieve in the form of a
list of primes (either directly or by converting a linear sieve) is a
great deal slower than a simple sieve for the majority use cases where
the sieve is not huge.

I don't want to take up peoples time but I am willing to expose the code
to anyone who has an interest as I am sure that it has wrinkles that
others with more experience with Python would find.

But I would also be interested to discover whether there is a rationale
for not offering list.count(value, limit) as this seems to me a useful
function which I have found myself wanting more than once now.

Thank you again for your input, which I appreciate.

   Brian

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-23 Thread Blind Anagram

On 23/04/2013 00:28, Steven D'Aprano wrote:
 On Mon, 22 Apr 2013 22:25:50 +0100, Blind Anagram wrote:
 
 I have looked at solutions based on listing primes and here I have found
 that they are very much slower than my existing solution when the sieve
 is not large (which is the majority use case).
 
 Yes. This is hardly surprising. Algorithms suitable for dealing with the 
 first million primes are not suitable for dealing with the first trillion 
 primes, and vice versa. We like to pretend that computer programming is 
 an abstraction, and for small enough data we often can get away with 
 that, but like all abstractions eventually it breaks and the cost of 
 dealing with real hardware becomes significant.
 
 But I must ask, given that the primes are so widely distributed, why are 
 you storing them in a list instead of a sparse array (i.e. a dict)? There 
 are 50,847,534 primes less than or equal to 1,000,000,000, so you are 
 storing roughly 18 False values for every True value. That ratio will 
 only get bigger. With a billion entries, you are using 18 times more 
 memory than necessary.

Because the majority use case for my Prime class is for a sieve that is
not large.  I am just pushing the envelope for a minority use case so
that it still works for huge sieves, albeit inefficiently.

I accept it is inefficient, but the fact remains that I can produce a
sieve that can yield and count a billion primes in a reasonable time but
this fails when I wish to count on a part of the sieve because this can
double the memory requirement for the lack of a list.count(value, limit)
function.

I would not dream of doing this job by copying a slice in any other
language that I have used so I was simply asking for advice to discover
whether this copy could be avoided whilst staying with the simple sieve
design.

Thank you for your input.

   Brian

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-23 Thread Blind Anagram

On 23/04/2013 02:47, Dave Angel wrote:
 On 04/22/2013 05:32 PM, Blind Anagram wrote:
 On 22/04/2013 22:03, Oscar Benjamin wrote:
 On 22 April 2013 21:18, Oscar Benjamin oscar.j.benja...@gmail.com
 wrote:
 On 22 April 2013 17:38, Blind Anagram blindanag...@nowhere.org wrote:
 On 22/04/2013 17:06, Oscar Benjamin wrote:

 I don't know what your application is but I would say that my first
 port of call here would be to consider a different algorithmic
 approach. An obvious question would be about the sparsity of this
 data
 structure. How frequent are the values that you are trying to count?
 Would it make more sense to store a list of their indices?

 Actually it is no more than a simple prime sieve implemented as a
 Python
 class (and, yes, I realize that there are plenty of these around).

 If I understand correctly, you have a list of roughly a billion
 True/False values indicating which integers are prime and which are
 not. You would like to discover how many prime numbers there are
 between two numbers a and b. You currently do this by counting the
 number of True values in your list between the indices a and b.

 If my description is correct then I would definitely consider using a
 different algorithmic approach. The density of primes from 1 to 1
 billlion is about 5%. Storing the prime numbers themselves in a sorted
 list would save memory and allow a potentially more efficient way of
 counting the number of primes within some interval.

 In fact it is probably quicker if you don't mind using all that memory
 to just store the cumulative sum of your prime True/False indicator
 list. This would be the prime counting function pi(n). You can then
 count the primes between a and b in constant time with pi[b] - pi[a].

 I did wonder whether, after creating the sieve, I should simply go
 through the list and replace the True values with a count.  This would
 certainly speed up the prime count function, which is where the issue
 arises.  I will try this and see what sort of performance trade-offs
 this involves.

 
 By doing that replacement, you'd increase memory usage manyfold (maybe
 3:1, I don't know).  As long as you're only using bools in the list, you
 only have the list overhead to consider, because all the objects
 involved are already cached (True and False exist only once each).  If
 you have integers, you'll need a new object for each nonzero count.

Thank you, Dave, you have answered a question that I was going to ask
before I even asked it!


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-23 Thread Blind Anagram

On 23/04/2013 00:01, Steven D'Aprano wrote:
 On Mon, 22 Apr 2013 15:15:19 +0100, Blind Anagram wrote:
 
 But when using a sub-sequence, I do suffer a significant reduction in
 speed for a count when compared with count on the full list.  When the
 list is small enough not to cause memory allocation issues this is about
 30% on 100,000,000 items.  But when the list is 1,000,000,000 items, OS
 memory allocation becomes an issue and the cost on my system rises to
 over 600%.
 
 Buy more memory :-)
 
 
 I agree that this is not a big issue but it seems to me a high price to
 pay for the lack of a sieve.count(value, limit), which I feel is a
 useful function (given that memoryview operations are not available for
 lists).
 
 There is no need to complicate the count method for such a specialised 
 use-case. A more general solution would be to provide list views. 
 
 Another solution might be to use arrays rather than lists. Since your 
 sieve list is homogeneous, you could possibly use an array of 1 or 0 
 bytes rather than a list of True or False bools. That would reduce the 
 memory overhead by a factor of four, and similarly reduce the overhead of 
 any copying:

I did a lot of work comparing the overall performance of the sieve when
using either lists or arrays and I found that lists were a lot faster
for the majority use case when the sieve is not large.

   Brian

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-23 Thread Oscar Benjamin

On 23 April 2013 08:05, Blind Anagram blindanag...@nowhere.org wrote:
 On 23/04/2013 00:01, Steven D'Aprano wrote:
 On Mon, 22 Apr 2013 15:15:19 +0100, Blind Anagram wrote:

 But when using a sub-sequence, I do suffer a significant reduction in
 speed for a count when compared with count on the full list.  When the
 list is small enough not to cause memory allocation issues this is about
 30% on 100,000,000 items.  But when the list is 1,000,000,000 items, OS
 memory allocation becomes an issue and the cost on my system rises to
 over 600%.
[snip]

 Another solution might be to use arrays rather than lists. Since your
 sieve list is homogeneous, you could possibly use an array of 1 or 0
 bytes rather than a list of True or False bools. That would reduce the
 memory overhead by a factor of four, and similarly reduce the overhead of
 any copying:

 I did a lot of work comparing the overall performance of the sieve when
 using either lists or arrays and I found that lists were a lot faster
 for the majority use case when the sieve is not large.

Okay, now I understand. I thought you were trying to optimise for
large lists, but in fact you have already optimised for small lists.
As a result you have made algorithmic choices that don't scale very
well. Since you're still concerned about performance on small lists
you don't want to rethink those choices. Instead you want a
micro-optimisation that would compensate for them.

Elsewhere you said:

 I would not dream of doing this job by copying a slice in any other
 language that I have used so I was simply asking for advice to discover
 whether this copy could be avoided whilst staying with the simple sieve
 design.

So you already knew that there would be problems with this method, but
you've chosen it anyway since it turned out to be fastest for small
lists. You could always just do a different thing when the list is
large:

def pi(self, n):
  if n  100:
return self.indicator[:n].sum()
  else:
return sum(itertools.islice(self.indicator, n))

However, if you really want to improve performance in computing pi(n)
for large n you should just use one of the existing algorithms having
sublinear space/time complexity. These also use evaluate pi(n) with
sieves but the sieve only needs to be as big as sqrt(n) rather than n
for the obvious method:
http://en.wikipedia.org/wiki/Prime-counting_function#Algorithms_for_evaluating_.CF.80.28x.29


Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-23 Thread Blind Anagram

On 23/04/2013 12:08, Oscar Benjamin wrote:
 On 23 April 2013 08:05, Blind Anagram blindanag...@nowhere.org wrote:
 On 23/04/2013 00:01, Steven D'Aprano wrote:
 On Mon, 22 Apr 2013 15:15:19 +0100, Blind Anagram wrote:

 But when using a sub-sequence, I do suffer a significant reduction in
 speed for a count when compared with count on the full list.  When the
 list is small enough not to cause memory allocation issues this is about
 30% on 100,000,000 items.  But when the list is 1,000,000,000 items, OS
 memory allocation becomes an issue and the cost on my system rises to
 over 600%.
 [snip]

 Another solution might be to use arrays rather than lists. Since your
 sieve list is homogeneous, you could possibly use an array of 1 or 0
 bytes rather than a list of True or False bools. That would reduce the
 memory overhead by a factor of four, and similarly reduce the overhead of
 any copying:

 I did a lot of work comparing the overall performance of the sieve when
 using either lists or arrays and I found that lists were a lot faster
 for the majority use case when the sieve is not large.
 
 Okay, now I understand. I thought you were trying to optimise for
 large lists, but in fact you have already optimised for small lists.
 As a result you have made algorithmic choices that don't scale very
 well. Since you're still concerned about performance on small lists
 you don't want to rethink those choices. Instead you want a
 micro-optimisation that would compensate for them.
 
 Elsewhere you said:
 
 I would not dream of doing this job by copying a slice in any other
 language that I have used so I was simply asking for advice to discover
 whether this copy could be avoided whilst staying with the simple sieve
 design.
 
 So you already knew that there would be problems with this method, but
 you've chosen it anyway since it turned out to be fastest for small
 lists. You could always just do a different thing when the list is
 large:

Your analysis of my rationale is sound except that I only found out that
I had a problem with counting in a subset of a list when I actually
tried this for a huge sieve.

It was only then that I discovered that there was no way of setting the
upper limit in list.count(x) and hence that I would have to create a
slice or find an alternative approach.

I then wondered why count for lists has no limits whereas count for
other objects (e.g. strings) has these.  I also wondered whether there
was an easy way of avoiding the slice, not that this is critical, but
rather because it is just nice to have a sieve that still actually works
for huge values, albeit inefficiently.

 def pi(self, n):
   if n  100:
 return self.indicator[:n].sum()
   else:
 return sum(itertools.islice(self.indicator, n))


I have looked at itertools.islice as a complete replacement for count()
where, on average, it was a lot slower. But I have not tried hybrid
strategies as you suggest - I'll take a look at this.

My thanks to you and others for the advice that has been offered.

   Brian

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-23 Thread Steven D'Aprano

On Tue, 23 Apr 2013 08:05:53 +0100, Blind Anagram wrote:

 I did a lot of work comparing the overall performance of the sieve when
 using either lists or arrays and I found that lists were a lot faster
 for the majority use case when the sieve is not large.

And when the sieve is large?

I don't actually believe that the bottleneck is the cost of taking a list 
slice. That's pretty fast, even for huge lists, and all efforts to skip 
making a copy by using itertools.islice actually ended up slower. But 
suppose it is the bottleneck. Then *sooner or later* arrays will win over 
lists, simply because they're smaller.

Of course, sooner or later might be much later.

I expect that you will not find a single algorithm, or data structure, 
that works optimally for both small and huge inputs. In general, there 
are two strategies you might take:


1) Use an algorithm or data structure which is efficient for small inputs 
with small inputs, and after some cut-off size, swap to a different 
algorithm which is efficient for large inputs. That swap over may require 
a one-off conversion cost, but provided your sieve never shrinks, this 
may not matter.


2) Use only the algorithm for large inputs. For small inputs, the 
difference between the two is insignificant in absolute terms (who cares 
if the operation takes 5ms instead of 1ms?), but for large N, there is a 
clear winner.

There's nothing that says you're limited to two algorithms. You may find 
that to really optimize things, you need three or more algorithms, each 
one optimized for a particular subset of inputs. Of course, all this 
added complexity is itself very costly. Is it worth it?



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-23 Thread Blind Anagram

On 23/04/2013 15:49, Steven D'Aprano wrote:
 On Tue, 23 Apr 2013 08:05:53 +0100, Blind Anagram wrote:
 
 I did a lot of work comparing the overall performance of the sieve when
 using either lists or arrays and I found that lists were a lot faster
 for the majority use case when the sieve is not large.
 
 And when the sieve is large?

I don't know but since the majority use case is when the sieve is small,
it makes sense to choose a list.

 I don't actually believe that the bottleneck is the cost of taking a list 
 slice. That's pretty fast, even for huge lists, and all efforts to skip 
 making a copy by using itertools.islice actually ended up slower. But 
 suppose it is the bottleneck. Then *sooner or later* arrays will win over 
 lists, simply because they're smaller.

Maybe you have not noticed that, when I am discussing a huge sieve, I am
simply pushing a sieve designed primarily for a small sieve lengths to
the absolute limit.  This is most definitely a minority use case.

In pushing the size of the sieve upwards, it is the slice operation that
is the first thing that 'breaks'.  This is because the slice can be
almost as big as the primary array so the OS gets driven into memory
allocation problems for a sieve that is about half the length it would
otherwise be. It still works but the cost of the slice once this point
is reached rises from about 20% to over 600% because of all the paging
going on.

The unavailable list.count(value, limit) function would hence allow the
sieve length to be up to twice as large before running into problems and
would also cut the 20% slice cost I am seeing on smaller sieve lengths.

So, all I was doing in asking for advice was to check whether there is
an easy way of avoiding the slice copy, not because this is critical,
but rather because it is a pity to limit the performance because Python
forces a (strictly unnecessary) copy in order to perform a count within
a part of a list.

In other words, the lack of a list.count(value, limit) function makes
Python less effective than it would otherwise be.  I haven't looked at
Python's C code base but I still wonder if there a good reason for NOT
providing this?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-23 Thread Oscar Benjamin

On 23 April 2013 17:57, Blind Anagram blindanag...@nowhere.org wrote:
 On 23/04/2013 15:49, Steven D'Aprano wrote:
 On Tue, 23 Apr 2013 08:05:53 +0100, Blind Anagram wrote:

 I did a lot of work comparing the overall performance of the sieve when
 using either lists or arrays and I found that lists were a lot faster
 for the majority use case when the sieve is not large.

 And when the sieve is large?

 I don't know but since the majority use case is when the sieve is small,
 it makes sense to choose a list.

That's an odd comment given what you said at the start of this thread:

Blind Anagram wrote:
 I would be grateful for any advice people can offer on the fastest way
 to count items in a sub-sequence of a large list.

 I have a list of boolean values that can contain many hundreds of
 millions of elements for which I want to count the number of True values
 in a sub-sequence, one from the start up to some value (say hi).


 I don't actually believe that the bottleneck is the cost of taking a list
 slice. That's pretty fast, even for huge lists, and all efforts to skip
 making a copy by using itertools.islice actually ended up slower. But
 suppose it is the bottleneck. Then *sooner or later* arrays will win over
 lists, simply because they're smaller.

 Maybe you have not noticed that, when I am discussing a huge sieve, I am
 simply pushing a sieve designed primarily for a small sieve lengths to
 the absolute limit.  This is most definitely a minority use case.

 In pushing the size of the sieve upwards, it is the slice operation that
 is the first thing that 'breaks'.  This is because the slice can be
 almost as big as the primary array so the OS gets driven into memory
 allocation problems for a sieve that is about half the length it would
 otherwise be. It still works but the cost of the slice once this point
 is reached rises from about 20% to over 600% because of all the paging
 going on.

You keep mentioning that you want it to work with a large sieve. I
would much rather compute the same quantities with a small sieve if
possible. If you were using the Lehmer/Meissel algorithm you would be
able to compute the same quantity (i.e. pi(1e9)) using a much smaller
sieve with 30k items instead of 1e9. that would fit *very* comfortably
in memory and you wouldn't even need to slice the list. Or to put it
another way, you could compute pi(~1e18) using your current sieve
without slicing or paging. If you want to lift the limit on computing
pi(x) this is clearly the way to go.


 The unavailable list.count(value, limit) function would hence allow the
 sieve length to be up to twice as large before running into problems and
 would also cut the 20% slice cost I am seeing on smaller sieve lengths.

 So, all I was doing in asking for advice was to check whether there is
 an easy way of avoiding the slice copy, not because this is critical,
 but rather because it is a pity to limit the performance because Python
 forces a (strictly unnecessary) copy in order to perform a count within
 a part of a list.

 In other words, the lack of a list.count(value, limit) function makes
 Python less effective than it would otherwise be.  I haven't looked at
 Python's C code base but I still wonder if there a good reason for NOT
 providing this?

If you feel that this is a good suggestion for an improvement to
Python consider posting it on python-ideas. I wasn't aware of the
equivalent functionality on strings but I see that the tuple.count()
function is the same as list.count().


Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-23 Thread Blind Anagram

On 23/04/2013 18:45, Oscar Benjamin wrote:

 I did a lot of work comparing the overall performance of the sieve when
 using either lists or arrays and I found that lists were a lot faster
 for the majority use case when the sieve is not large.

 And when the sieve is large?

 I don't know but since the majority use case is when the sieve is small,
 it makes sense to choose a list.
 
 That's an odd comment given what you said at the start of this thread:
 
 Blind Anagram wrote:
 I would be grateful for any advice people can offer on the fastest way
 to count items in a sub-sequence of a large list.

 I have a list of boolean values that can contain many hundreds of
 millions of elements for which I want to count the number of True values
 in a sub-sequence, one from the start up to some value (say hi).

At this early stage in the discussion I was simply explaining the
immediate context of the problem on which I was seeking advice.

Here I didn't think it was necessary to expand on the wider context
since there might have been an very easy way that people could suggest
for avoiding the slice copy when counting on a part of a list.

But there isn't, so the wider details then became important in
explaining why some proposals might work for the limiting case but would
not make sense within the overall context of use.  And here I have said
on more than one occasion that the huge sieve case is a minority use case.

[snip]
 You keep mentioning that you want it to work with a large sieve. I
 would much rather compute the same quantities with a small sieve if
 possible. If you were using the Lehmer/Meissel algorithm you would be
 able to compute the same quantity (i.e. pi(1e9)) using a much smaller
 sieve with 30k items instead of 1e9. that would fit *very* comfortably
 in memory and you wouldn't even need to slice the list. Or to put it
 another way, you could compute pi(~1e18) using your current sieve
 without slicing or paging. If you want to lift the limit on computing
 pi(x) this is clearly the way to go.

If prime_pi for huge numbers was really important to me I wouldn't be
using Python!

This is just one of a number of functions implemented in the class.  It
is nice to have and it is also nice for testing purposes to be able to
run it for large sieves. But it is by no means important enough to
devote dedicated code to computing it. The prime generator within the
class is far more important and is the workhorse for most uses

[snip]
 In other words, the lack of a list.count(value, limit) function makes
 Python less effective than it would otherwise be.  I haven't looked at
 Python's C code base but I still wonder if there a good reason for NOT
 providing this?
 
 If you feel that this is a good suggestion for an improvement to
 Python consider posting it on python-ideas. I wasn't aware of the
 equivalent functionality on strings but I see that the tuple.count()
 function is the same as list.count().

To be honest, I am not really in a position to judge whether this is a
'good' suggestion.  It has turned up as potentially useful in two cases
for me but one of these (the one discussed here) is a minority use case.
 I would also be a bit worried about launching this into a group of
Python experts who will almost certainly be even more demanding of
explanations than you folk here!

   Brian

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-23 Thread Terry Jan Reedy


On 4/23/2013 7:45 AM, Blind Anagram wrote:


I then wondered why count for lists has no limits


Probably because no one has asked for such, as least partly because it 
is not really needed. In any case, .count(s) is a generic method. It is 
part of the definition of a Sequence. It can also be implemented for 
non-sequence collections, such as a Tree class, that allow multiple 
occurrences of an item.


 whereas count for other objects (e.g. strings) has these.

Strings (of unicode or bytes) are exceptional in multiple ways.

--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-23 Thread Oscar Benjamin

On 23 April 2013 19:30, Blind Anagram blindanag...@nowhere.org wrote:
 On 23/04/2013 18:45, Oscar Benjamin wrote:

 [snip]
 You keep mentioning that you want it to work with a large sieve. I
 would much rather compute the same quantities with a small sieve if
 possible. If you were using the Lehmer/Meissel algorithm you would be
 able to compute the same quantity (i.e. pi(1e9)) using a much smaller
 sieve with 30k items instead of 1e9. that would fit *very* comfortably
 in memory and you wouldn't even need to slice the list. Or to put it
 another way, you could compute pi(~1e18) using your current sieve
 without slicing or paging. If you want to lift the limit on computing
 pi(x) this is clearly the way to go.

 If prime_pi for huge numbers was really important to me I wouldn't be
 using Python!

I would, at least to begin with. The advantage that Python has for
this sort of thing is that it takes a minimal amount of programmer
effort to implement these kind of algorithms. This is because you
don't have to worry about nuts and bolts problems like integer
overflow and memory allocation. You also have an abundance of
different data structures to fulfil the big-O requirements of most
algorithms. It's easy to make generators that can iterate over
infinite or arbitrarily large sequences while still using constant
memory. And so on...

To implement the algorithm I mentioned in e.g. C would be considerably
more work. C would, however, perform much better using your brute
force approach and would lift the memory constraints of your program
by a small (in asymptotic terms) amount.

So I actually find that I can often get a faster program in Python
simply because it's much easier to implement fancier algorithms with
the optimal asymptotic performance that will ultimately outperform a
brute force approach whichever language is used.

Although, in saying that, those particular algorithms are probably
most naturally implemented in a functional language like Haskell with
scalable recursion.

 In other words, the lack of a list.count(value, limit) function makes
 Python less effective than it would otherwise be.  I haven't looked at
 Python's C code base but I still wonder if there a good reason for NOT
 providing this?

 If you feel that this is a good suggestion for an improvement to
 Python consider posting it on python-ideas. I wasn't aware of the
 equivalent functionality on strings but I see that the tuple.count()
 function is the same as list.count().

 To be honest, I am not really in a position to judge whether this is a
 'good' suggestion.  It has turned up as potentially useful in two cases
 for me but one of these (the one discussed here) is a minority use case.
  I would also be a bit worried about launching this into a group of
 Python experts who will almost certainly be even more demanding of
 explanations than you folk here!

I wouldn't worry about that. I've never felt the need for this but
then I would probably use numpy to do what you're doing. It's
certainly not an outlandish suggestion and I'd be surprised if no one
agreed with the idea.


Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-23 Thread Terry Jan Reedy


On 4/23/2013 12:57 PM, Blind Anagram wrote:


So, all I was doing in asking for advice was to check whether there is
an easy way of avoiding the slice copy,


And there is.


not because this is critical,
but rather because it is a pity to limit the performance because Python
forces a (strictly unnecessary) copy in order to perform a count within
a part of a list.


Python does not force that. You have been given several simple no-copy 
alternatives. They happen to be slower *with CPython* because of the 
speed difference between Python code and C code. If the same timing 
tests were done with any of the implementations that execute python code 
faster, the results would likely be different.


I thing str/byte/bytearray.count have more need for optional start,stop 
boundary parameters because a) people search in long texts and subtexts, 
more so I think that for other sequences, b) they search for substrings 
longer than 1 and hence c) the generic no-slice alternatives do not work 
for counting substrings.


That said, I do see that tuple/list.index have had start, stop 
paramaters added, so doing the same for .count is conceivable. I just do 
not remember anyone else asking for such. The use case must be very 
rare. And as I said in my other post, .count(x) applies to any 
collections, but start,stop would only apply to sequences.



In other words, the lack of a list.count(value, limit) function makes
Python less effective than it would otherwise be.


Untrue. The alternatives are just as *effective*.


--
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-23 Thread Oscar Benjamin

On 23 April 2013 21:00, Terry Jan Reedy tjre...@udel.edu wrote:

 That said, I do see that tuple/list.index have had start, stop paramaters
 added, so doing the same for .count is conceivable.

Are those new? I don't remember them not being there.

You need the start/stop parameters to be able to use index for finding
all occurrences of an item rather than just the first:

def indices(value, sequence):
i = -1
while True:
try:
i = sequence.index(value, i + 1)
except ValueError:
return
yield i

I thought this was a common use case for .index() and that if the
interface had been designed more recently then it might have just been
an .indices() method that returned an iterator.


Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-23 Thread Blind Anagram

On 23/04/2013 21:00, Terry Jan Reedy wrote:
 On 4/23/2013 12:57 PM, Blind Anagram wrote:
 
 So, all I was doing in asking for advice was to check whether there is
 an easy way of avoiding the slice copy,
 
 And there is.
 
 not because this is critical,
 but rather because it is a pity to limit the performance because Python
 forces a (strictly unnecessary) copy in order to perform a count within
 a part of a list.

 Python does not force that. You have been given several simple no-copy
 alternatives. They happen to be slower *with CPython* because of the
 speed difference between Python code and C code. If the same timing
 tests were done with any of the implementations that execute python code
 faster, the results would likely be different.

Then being pedantic rather than colloquial, the lack of start, end
parameters in the function list.count(value) means that anyone wishing
to use this function on a part of a list is forced to slice the list and
thereby invoke a possibly costly copy operation, one that is, in
principle, not necessary in order to perform the underlying operation.

[snip]
 In other words, the lack of a list.count(value, limit) function makes
 Python less effective than it would otherwise be.
 
 Untrue. The alternatives are just as *effective*.

Then I fear that we will have to accept that we disagree on this.

   Brian

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-23 Thread Steven D'Aprano

On Tue, 23 Apr 2013 17:57:17 +0100, Blind Anagram wrote:

 On 23/04/2013 15:49, Steven D'Aprano wrote:
 On Tue, 23 Apr 2013 08:05:53 +0100, Blind Anagram wrote:
 
 I did a lot of work comparing the overall performance of the sieve
 when using either lists or arrays and I found that lists were a lot
 faster for the majority use case when the sieve is not large.
 
 And when the sieve is large?
 
 I don't know but since the majority use case is when the sieve is small,
 it makes sense to choose a list.
 
 I don't actually believe that the bottleneck is the cost of taking a
 list slice. That's pretty fast, even for huge lists, and all efforts to
 skip making a copy by using itertools.islice actually ended up slower.
 But suppose it is the bottleneck. Then *sooner or later* arrays will
 win over lists, simply because they're smaller.
 
 Maybe you have not noticed that, when I am discussing a huge sieve, I am
 simply pushing a sieve designed primarily for a small sieve lengths to
 the absolute limit.  This is most definitely a minority use case.

You don't say? I hadn't noticed the first three hundred times you 
mentioned it :-P


In my opinion, it is more important to be efficient for large sieves, not 
small. As they say, for small N, everything is fast. Nobody is going to 
care about the difference between small-N taking 1ms or 10ms, but they 
will care about the difference between big-N taking 1 minute or 1 hour. 
So, as a general rule, optimize the expensive cases, and if the cheap 
cases are a little less cheap than they otherwise would have been, nobody 
will care or notice.

Of course, it's your code, and you're free to make whatever trade-offs 
between time and space and programmer effort that you feel are best. I'm 
just making a suggestion.


[...]
 In other words, the lack of a list.count(value, limit) function makes
 Python less effective than it would otherwise be.  I haven't looked at
 Python's C code base but I still wonder if there a good reason for NOT
 providing this?

The same reasons for not providing any other of an infinite number of 
potential features. You have to weigh up the potential benefits of the 
feature against the costs:

- Is this a good use of developer time to build the feature?

- Every new feature requires somebody to write it, somebody to check it, 
somebody to write tests for it, somebody to document it. Who is going to 
do all this work?

- Every new feature increases the size and complexity of the language and 
makes it harder for people to learn.

- Does this new feature actually have the advantages claimed? Many so-
called optimizations actually turn out to be pessimizations instead.

- Will the new feature be an attractive nuisance, fooling programmers 
into using it inappropriately?


In this case, I would say that adding a limit argument to list.count is 
*absolutely not* worthwhile, because it is insufficiently general. To be 
general enough to be worthwhile, you would have to add three arguments:

list.count(value, start=0, end=None, step=1)

But even there I don't think it's general enough to cover the costs. I 
believe that a better solution would be to create list views to offer a 
subset of the list without actually copying.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

List Count

2013-04-22 Thread Blind Anagram

I would be grateful for any advice people can offer on the fastest way
to count items in a sub-sequence of a large list.

I have a list of boolean values that can contain many hundreds of
millions of elements for which I want to count the number of True values
in a sub-sequence, one from the start up to some value (say hi).

I am currently using:

   sieve[:hi].count(True)

but I believe this may be costly because it copies a possibly large part
of the sieve.

Ideally I would like to be able to use:

   sieve.count(True, hi)

where 'hi' sets the end of the count but this function is, sadly, not
available for lists.

The use of a bytearray with a memoryview object instead of a list solves
this particular problem but it is not a solution for me as it creates
more problems than it solves in other aspects of the program.

Can I assume that one possible solution would be to sub-class list and
create a C based extension to provide list.count(value, limit)?

Are there any other solutions that will avoid copying a large part of
the list?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Dave Angel


On 04/22/2013 07:58 AM, Blind Anagram wrote:

I would be grateful for any advice people can offer on the fastest way
to count items in a sub-sequence of a large list.

I have a list of boolean values that can contain many hundreds of
millions of elements for which I want to count the number of True values
in a sub-sequence, one from the start up to some value (say hi).

I am currently using:

sieve[:hi].count(True)

but I believe this may be costly because it copies a possibly large part
of the sieve.

Ideally I would like to be able to use:

sieve.count(True, hi)

where 'hi' sets the end of the count but this function is, sadly, not
available for lists.

The use of a bytearray with a memoryview object instead of a list solves
this particular problem but it is not a solution for me as it creates
more problems than it solves in other aspects of the program.

Can I assume that one possible solution would be to sub-class list and
create a C based extension to provide list.count(value, limit)?

Are there any other solutions that will avoid copying a large part of
the list?



Instead of using the default slice notation, why not use 
itertools.islice() ?


Something like  (untested):

import itertools

it = itertools.islice(sieve, 0, hi)
sum(itertools.imap(bool, it))

I only broke it into two lines for clarity.  It could also be:

sum(itertools.imap(bool, itertools.islice(sieve, 0, hi)))

If you're using Python 3.x, say so, and I'm sure somebody can simplify 
these, since in Python 3, many functions already produce iterators 
instead of lists.



--
DaveA
--
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Blind Anagram

On 22/04/2013 13:51, Dave Angel wrote:
 On 04/22/2013 07:58 AM, Blind Anagram wrote:
 I would be grateful for any advice people can offer on the fastest way
 to count items in a sub-sequence of a large list.

 I have a list of boolean values that can contain many hundreds of
 millions of elements for which I want to count the number of True values
 in a sub-sequence, one from the start up to some value (say hi).

 I am currently using:

 sieve[:hi].count(True)

 but I believe this may be costly because it copies a possibly large part
 of the sieve.

 Ideally I would like to be able to use:

 sieve.count(True, hi)

 where 'hi' sets the end of the count but this function is, sadly, not
 available for lists.

 The use of a bytearray with a memoryview object instead of a list solves
 this particular problem but it is not a solution for me as it creates
 more problems than it solves in other aspects of the program.

 Can I assume that one possible solution would be to sub-class list and
 create a C based extension to provide list.count(value, limit)?

 Are there any other solutions that will avoid copying a large part of
 the list?

 
 Instead of using the default slice notation, why not use
 itertools.islice() ?
 
 Something like  (untested):
 
 import itertools
 
 it = itertools.islice(sieve, 0, hi)
 sum(itertools.imap(bool, it))
 
 I only broke it into two lines for clarity.  It could also be:
 
 sum(itertools.imap(bool, itertools.islice(sieve, 0, hi)))
 
 If you're using Python 3.x, say so, and I'm sure somebody can simplify
 these, since in Python 3, many functions already produce iterators
 instead of lists.

Thanks, I'll look at these ideas.  And, yes, my interest is mainly in
Python 3.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Steven D'Aprano

On Mon, 22 Apr 2013 12:58:20 +0100, Blind Anagram wrote:

 I would be grateful for any advice people can offer on the fastest way
 to count items in a sub-sequence of a large list.
 
 I have a list of boolean values that can contain many hundreds of
 millions of elements for which I want to count the number of True values
 in a sub-sequence, one from the start up to some value (say hi).
 
 I am currently using:
 
sieve[:hi].count(True)
 
 but I believe this may be costly because it copies a possibly large part
 of the sieve.

Have you timed it? Because Python is a high-level language, it is rarely 
obvious what code will be fast. Yes, sieve[:hi] will copy the first hi 
entries, but that's likely to be fast, basically just a memcopy, unless 
sieve is huge and memory is short. In other words, unless your sieve is 
so huge that the operating system cannot find enough memory for it, 
making a copy is likely to be relatively insignificant.

I've just tried seven different techniques to optimize this, and the 
simplest, most obvious technique is by far the fastest. Here are the 
seven different code snippets I measured, with results:


sieve[:hi].count(True)
sum(sieve[:hi])
sum(islice(sieve, hi))
sum(x for x in islice(sieve, hi) if x)
sum(x for x in islice(sieve, hi) if x is True)
sum(1 for x in islice(sieve, hi) if x is True)
len(list(filter(None, islice(sieve, hi


Here's the code I used to time them. Just copy and paste into an 
interactive interpreter:

=== cut ===

import random
sieve = [random.random()  0.5 for i in range(10**7)]

from timeit import Timer
setup = from __main__ import sieve
from itertools import islice
hi = 7*10**6


t1 = Timer(sieve[:hi].count(True), setup)
t2 = Timer(sum(sieve[:hi]), setup)
t3 = Timer(sum(islice(sieve, hi)), setup)
t4 = Timer(sum(x for x in islice(sieve, hi) if x), setup)
t5 = Timer(sum(x for x in islice(sieve, hi) if x is True), setup)
t6 = Timer(sum(1 for x in islice(sieve, hi) if x is True), setup)
t7 = Timer(len(list(filter(None, islice(sieve, hi, setup)

for t in (t1, t2, t3, t4, t5, t6, t7):
print( min(t.repeat(number=10)) )

=== cut ===


On my computer, using Python 3.3, here are the timing results I get:

2.3714727330952883
7.96061935601756
7.230580328032374
10.080201900098473
11.544118068180978
9.216834562830627
3.499635103158653


Times shown are in seconds, and are for the best of three trials, each 
trial having 10 repetitions of the code being tested.

As you can see, clever tricks using sum are horrible pessimisations, the 
only thing that comes close to the obvious solution is the one using 
filter.

Although I have only tested a list with ten million items, not hundreds 
of millions, I don't expect that the results will be significantly 
different if you use a larger list, unless you are very short of memory.

[...]
 Can I assume that one possible solution would be to sub-class list and
 create a C based extension to provide list.count(value, limit)?

Of course. But don't optimize this until you know that you *need* to 
optimize it. Is it really a bottleneck in your code? There's no point in 
saving the 0.1 second it takes to copy the list if it takes 2 seconds to 
count the items regardless.


 Are there any other solutions that will avoid copying a large part of
 the list?

Yes, but they're slower.

Perhaps a better solution might be to avoid counting anything. If you can 
keep a counter, and each time you add a value to the list you update the 
counter, then getting the number of True values will be instantaneous.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Peter Otten

Blind Anagram wrote:

 I would be grateful for any advice people can offer on the fastest way
 to count items in a sub-sequence of a large list.
 
 I have a list of boolean values that can contain many hundreds of
 millions of elements for which I want to count the number of True values
 in a sub-sequence, one from the start up to some value (say hi).
 
 I am currently using:
 
sieve[:hi].count(True)
 
 but I believe this may be costly because it copies a possibly large part
 of the sieve.
 
 Ideally I would like to be able to use:
 
sieve.count(True, hi)
 
 where 'hi' sets the end of the count but this function is, sadly, not
 available for lists.
 
 The use of a bytearray with a memoryview object instead of a list solves
 this particular problem but it is not a solution for me as it creates
 more problems than it solves in other aspects of the program.
 
 Can I assume that one possible solution would be to sub-class list and
 create a C based extension to provide list.count(value, limit)?
 
 Are there any other solutions that will avoid copying a large part of
 the list?

If the list doesn't change often you can convert it to a string

 items = [True, False, False] * 10
 sitems = .join(FT[i] for i in items)
 sitems
'TFFTFFTFFTFFTFFTFFTFFTFFTFFTFF'
 sitems.count(T, 3, 10)
3
 sitems.count(F, 3, 10)
4

Or you use a[3:10].sum() on a boolean numpy array. Its slices are views 
rather than copies:

 import numpy
 a = numpy.array([True, False, False]*10)
 a[3:10].sum()
3


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread 88888 Dihedral

Blind Anagram於 2013年4月22日星期一UTC+8下午7時58分20秒寫道：
 I would be grateful for any advice people can offer on the fastest way
 
 to count items in a sub-sequence of a large list.
 
 
 
 I have a list of boolean values that can contain many hundreds of
 
 millions of elements for which I want to count the number of True values
 
 in a sub-sequence, one from the start up to some value (say hi).
 
 
 
 I am currently using:
 
 
 
sieve[:hi].count(True)
 
 
 
 but I believe this may be costly because it copies a possibly large part
 
 of the sieve.
 
 
 
 Ideally I would like to be able to use:
 
 
 
sieve.count(True, hi)
 
 
 
 where 'hi' sets the end of the count but this function is, sadly, not
 
 available for lists.
 
 
 
 The use of a bytearray with a memoryview object instead of a list solves
 
 this particular problem but it is not a solution for me as it creates
 
 more problems than it solves in other aspects of the program.
 
 
 
 Can I assume that one possible solution would be to sub-class list and
 
 create a C based extension to provide list.count(value, limit)?
 
 
 
 Are there any other solutions that will avoid copying a large part of
 
 the list?

For those problems related to a homogeneous list of numbers
, please check whether  the arrays in numpy can fit your needs practically or 
not.


Sometimes I work on numbers  in varied ranges, 
then the list and the long integers in Python is really handy.





-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Skip Montanaro

Numpy is a big improvement here.  In Py 2.7 I get this output if I run
Steven's benchmark:

2.10364603996
3.68471002579
4.01849389076
7.41974878311
10.4202470779
9.16782712936
3.36137390137

(confirming his results).  If I then run the numpy idiom for this:


import random
from timeit import Timer

import numpy

sieve = numpy.array([random.random()  0.5 for i in range(10**7)],
dtype=bool)

setup = from __main__ import sieve
from itertools import islice
hi = 7*10**6


t1 = Timer((True == sieve[:hi]).sum(), setup)

print(min(t1.repeat(number=10)))
###

I get :

0.344316959381

It likely consumes less space as well, since it doesn't store Python
objects in the array.

Skip
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Blind Anagram

On 22/04/2013 14:13, Steven D'Aprano wrote:
 On Mon, 22 Apr 2013 12:58:20 +0100, Blind Anagram wrote:
 
 I would be grateful for any advice people can offer on the fastest way
 to count items in a sub-sequence of a large list.

 I have a list of boolean values that can contain many hundreds of
 millions of elements for which I want to count the number of True values
 in a sub-sequence, one from the start up to some value (say hi).

 I am currently using:

sieve[:hi].count(True)

 but I believe this may be costly because it copies a possibly large part
 of the sieve.
 
 Have you timed it? Because Python is a high-level language, it is rarely 
 obvious what code will be fast. Yes, sieve[:hi] will copy the first hi 
 entries, but that's likely to be fast, basically just a memcopy, unless 
 sieve is huge and memory is short. In other words, unless your sieve is 
 so huge that the operating system cannot find enough memory for it, 
 making a copy is likely to be relatively insignificant.
 
 I've just tried seven different techniques to optimize this, and the 
 simplest, most obvious technique is by far the fastest. Here are the 
 seven different code snippets I measured, with results:
 
 
 sieve[:hi].count(True)
 sum(sieve[:hi])
 sum(islice(sieve, hi))
 sum(x for x in islice(sieve, hi) if x)
 sum(x for x in islice(sieve, hi) if x is True)
 sum(1 for x in islice(sieve, hi) if x is True)
 len(list(filter(None, islice(sieve, hi

Yes, I did time it and I agree with your results (where my tests overlap
with yours).

But when using a sub-sequence, I do suffer a significant reduction in
speed for a count when compared with count on the full list.  When the
list is small enough not to cause memory allocation issues this is about
30% on 100,000,000 items.  But when the list is 1,000,000,000 items, OS
memory allocation becomes an issue and the cost on my system rises to
over 600%.

I agree that this is not a big issue but it seems to me a high price to
pay for the lack of a sieve.count(value, limit), which I feel is a
useful function (given that memoryview operations are not available for
lists).

 Of course. But don't optimize this until you know that you *need* to 
 optimize it. Is it really a bottleneck in your code? There's no point in 
 saving the 0.1 second it takes to copy the list if it takes 2 seconds to 
 count the items regardless.
 
 Are there any other solutions that will avoid copying a large part of
 the list?
 
 Yes, but they're slower.
 
 Perhaps a better solution might be to avoid counting anything. If you can 
 keep a counter, and each time you add a value to the list you update the 
 counter, then getting the number of True values will be instantaneous.

Creating the sieve is currently very fast as it is not done by adding
single items but by adding a large number of items at the same time
using a slice operation.  I could count the items in each slice as it is
added but this would add complexity that I would prefer to avoid because
the creation of the sieve is quite tricky to get right and I would
prefer not to fiddle with this.

Thank you (and others) for advice on this.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Oscar Benjamin

On 22 April 2013 15:15, Blind Anagram blindanag...@nowhere.org wrote:
 On 22/04/2013 14:13, Steven D'Aprano wrote:
 On Mon, 22 Apr 2013 12:58:20 +0100, Blind Anagram wrote:

 I would be grateful for any advice people can offer on the fastest way
 to count items in a sub-sequence of a large list.

 I have a list of boolean values that can contain many hundreds of
 millions of elements for which I want to count the number of True values
 in a sub-sequence, one from the start up to some value (say hi).

 I am currently using:

sieve[:hi].count(True)

 but I believe this may be costly because it copies a possibly large part
 of the sieve.
[snip]

 But when using a sub-sequence, I do suffer a significant reduction in
 speed for a count when compared with count on the full list.  When the
 list is small enough not to cause memory allocation issues this is about
 30% on 100,000,000 items.  But when the list is 1,000,000,000 items, OS
 memory allocation becomes an issue and the cost on my system rises to
 over 600%.

Have you tried using numpy? I find that it reduces the memory required
to store a list of bools by a factor of 4 on my 32 bit system. I would
expect that to be a factor of 8 on a 64 bit system:

 import sys
 a = [True] * 100
 sys.getsizeof(a)
436
 import numpy
 a = numpy.ndarray(100, bool)
 sys.getsizeof(a)  # This does not include the data buffer
40
 a.nbytes
100

The numpy array also has the advantage that slicing does not actually
copy the data (as has already been mentioned). On this system slicing
a numpy array has a 40 byte overhead regardless of the size of the
slice.

 I agree that this is not a big issue but it seems to me a high price to
 pay for the lack of a sieve.count(value, limit), which I feel is a
 useful function (given that memoryview operations are not available for
 lists).

It would be very easy to subclass list and add this functionality in
cython if you decide that you do need a builtin method.


Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Blind Anagram

On 22/04/2013 16:14, Oscar Benjamin wrote:

 On 22 April 2013 15:15, Blind Anagram blindanag...@nowhere.org wrote:
 On 22/04/2013 14:13, Steven D'Aprano wrote:
 On Mon, 22 Apr 2013 12:58:20 +0100, Blind Anagram wrote:

 I would be grateful for any advice people can offer on the fastest way
 to count items in a sub-sequence of a large list.

 I have a list of boolean values that can contain many hundreds of
 millions of elements for which I want to count the number of True values
 in a sub-sequence, one from the start up to some value (say hi).

 I am currently using:

sieve[:hi].count(True)

 but I believe this may be costly because it copies a possibly large part
 of the sieve.
 [snip]

 But when using a sub-sequence, I do suffer a significant reduction in
 speed for a count when compared with count on the full list.  When the
 list is small enough not to cause memory allocation issues this is about
 30% on 100,000,000 items.  But when the list is 1,000,000,000 items, OS
 memory allocation becomes an issue and the cost on my system rises to
 over 600%.
 
 Have you tried using numpy? I find that it reduces the memory required
 to store a list of bools by a factor of 4 on my 32 bit system. I would
 expect that to be a factor of 8 on a 64 bit system:
 
 import sys
 a = [True] * 100
 sys.getsizeof(a)
 436
 import numpy
 a = numpy.ndarray(100, bool)
 sys.getsizeof(a)  # This does not include the data buffer
 40
 a.nbytes
 100
 
 The numpy array also has the advantage that slicing does not actually
 copy the data (as has already been mentioned). On this system slicing
 a numpy array has a 40 byte overhead regardless of the size of the
 slice.
 
 I agree that this is not a big issue but it seems to me a high price to
 pay for the lack of a sieve.count(value, limit), which I feel is a
 useful function (given that memoryview operations are not available for
 lists).
 
 It would be very easy to subclass list and add this functionality in
 cython if you decide that you do need a builtin method.

Thanks Oscar, I'll take a look at this.

But I was really wondering if there was a simple solution that worked
without people having to add libraries to their basic Python installations.

As I have never tried building an extension with cython, I am inclined
to try this as a learning exercise if nothing else.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Oscar Benjamin

On 22 April 2013 16:50, Blind Anagram blindanag...@nowhere.org wrote:

 It would be very easy to subclass list and add this functionality in
 cython if you decide that you do need a builtin method.
[snip]

 But I was really wondering if there was a simple solution that worked
 without people having to add libraries to their basic Python installations.

There are simple solutions and some have already been listed. You are
attempting to push your program to the limit of your hardware
capabilities and it's natural that in a high-level language you'll
often want special libraries for that.

I don't know what your application is but I would say that my first
port of call here would be to consider a different algorithmic
approach. An obvious question would be about the sparsity of this data
structure. How frequent are the values that you are trying to count?
Would it make more sense to store a list of their indices?

If the problem needs to be solved the way that you are currently doing
it and the available methods are not fast enough then you will need to
consider additional libraries.


 As I have never tried building an extension with cython, I am inclined
 to try this as a learning exercise if nothing else.

I definitely recommend this over writing a C extension directly.


Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Blind Anagram

On 22/04/2013 17:06, Oscar Benjamin wrote:
 On 22 April 2013 16:50, Blind Anagram blindanag...@nowhere.org wrote:

 It would be very easy to subclass list and add this functionality in
 cython if you decide that you do need a builtin method.
 [snip]

 But I was really wondering if there was a simple solution that worked
 without people having to add libraries to their basic Python installations.
 
 There are simple solutions and some have already been listed. You are
 attempting to push your program to the limit of your hardware
 capabilities and it's natural that in a high-level language you'll
 often want special libraries for that.

Hi Oscar

Yes, but it is a tribute to Python that I can do this quite fast for
huge lists provided that I only count on the full list.

And, unless I have completely misunderstood Python internals, it would
probably be just as fast on a sub-sequence if I had a list.count(value,
limit) function (however, I admit that I could be wrong here since the
fact that count on lists does not offer this may mean that it is not as
easy to implement as it might seem).

 I don't know what your application is but I would say that my first
 port of call here would be to consider a different algorithmic
 approach. An obvious question would be about the sparsity of this data
 structure. How frequent are the values that you are trying to count?
 Would it make more sense to store a list of their indices?

Actually it is no more than a simple prime sieve implemented as a Python
class (and, yes, I realize that there are plenty of these around).

 If the problem needs to be solved the way that you are currently doing
 it and the available methods are not fast enough then you will need to
 consider additional libraries.

 As I have never tried building an extension with cython, I am inclined
 to try this as a learning exercise if nothing else.
 
 I definitely recommend this over writing a C extension directly.

Thanks again - I will definitely look at this.

   Brian

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Skip Montanaro

 But I was really wondering if there was a simple solution that worked
 without people having to add libraries to their basic Python installations.

I think installing numpy is approximately

pip install numpy

assuming you have write access to your site-packages directory.  If
not, install using --prefix and set PYTHONPATH accordingly.

I forgot that Python also has an array module.  With numpy available,
mature, and well-supported, I imagine it doesn't get much love these
days though.  Still, I gave it a whirl:

###
import random
import array
from timeit import Timer

import numpy

stuff = [random.random()  0.5 for i in range(10**7)]
sieve1 = numpy.array(stuff, dtype=bool)
sieve2 = array.array('B', stuff)

setup = from __main__ import sieve1, sieve2
from itertools import islice
hi = 7*10**6


t1 = Timer((True == sieve1[:hi]).sum(), setup)
t2 = Timer(sieve2[:hi].count(True), setup)
# t3 = Timer(sum(islice(sieve, hi)), setup)
# t4 = Timer(sum(x for x in islice(sieve, hi) if x), setup)
# t5 = Timer(sum(x for x in islice(sieve, hi) if x is True), setup)
# t6 = Timer(sum(1 for x in islice(sieve, hi) if x is True), setup)
# t7 = Timer(len(list(filter(None, islice(sieve, hi, setup)

print(min(t1.repeat(number=10)))
print(min(t2.repeat(number=10)))
# for t in (t1, t2, t3, t4, t5, t6, t7):
# print( min(t.repeat(number=10)) )
###

Performance was not all that impressive:

0.340315103531
5.42102503777

Still, you might fiddle around with it a bit.  Perhaps unsigned ints
instead of unsigned bytes will provide more efficient counting...

Skip
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Blind Anagram

On 22/04/2013 18:48, Skip Montanaro wrote:
 But I was really wondering if there was a simple solution that worked
 without people having to add libraries to their basic Python installations.
 
 I think installing numpy is approximately
 
 pip install numpy
 
 assuming you have write access to your site-packages directory.  If
 not, install using --prefix and set PYTHONPATH accordingly.
 
 I forgot that Python also has an array module.  With numpy available,
 mature, and well-supported, I imagine it doesn't get much love these
 days though.  Still, I gave it a whirl:
 
 ###
 import random
 import array
 from timeit import Timer
 
 import numpy
 
 stuff = [random.random()  0.5 for i in range(10**7)]
 sieve1 = numpy.array(stuff, dtype=bool)
 sieve2 = array.array('B', stuff)
 
 setup = from __main__ import sieve1, sieve2
 from itertools import islice
 hi = 7*10**6
 
 
 t1 = Timer((True == sieve1[:hi]).sum(), setup)
 t2 = Timer(sieve2[:hi].count(True), setup)
 # t3 = Timer(sum(islice(sieve, hi)), setup)
 # t4 = Timer(sum(x for x in islice(sieve, hi) if x), setup)
 # t5 = Timer(sum(x for x in islice(sieve, hi) if x is True), setup)
 # t6 = Timer(sum(1 for x in islice(sieve, hi) if x is True), setup)
 # t7 = Timer(len(list(filter(None, islice(sieve, hi, setup)
 
 print(min(t1.repeat(number=10)))
 print(min(t2.repeat(number=10)))
 # for t in (t1, t2, t3, t4, t5, t6, t7):
 # print( min(t.repeat(number=10)) )
 ###
 
 Performance was not all that impressive:
 
 0.340315103531
 5.42102503777
 
 Still, you might fiddle around with it a bit.  Perhaps unsigned ints
 instead of unsigned bytes will provide more efficient counting...

I spent a lot of time comparing python arrays and lists but found that
lists were always much faster in this application.

I do have numpy installed but I remember that when I did this (some time
ago) it was far from easy with Python 3.x running natively on Windows x64.

  Brian

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Oscar Benjamin

On 22 April 2013 17:38, Blind Anagram blindanag...@nowhere.org wrote:
 On 22/04/2013 17:06, Oscar Benjamin wrote:

 I don't know what your application is but I would say that my first
 port of call here would be to consider a different algorithmic
 approach. An obvious question would be about the sparsity of this data
 structure. How frequent are the values that you are trying to count?
 Would it make more sense to store a list of their indices?

 Actually it is no more than a simple prime sieve implemented as a Python
 class (and, yes, I realize that there are plenty of these around).

If I understand correctly, you have a list of roughly a billion
True/False values indicating which integers are prime and which are
not. You would like to discover how many prime numbers there are
between two numbers a and b. You currently do this by counting the
number of True values in your list between the indices a and b.

If my description is correct then I would definitely consider using a
different algorithmic approach. The density of primes from 1 to 1
billlion is about 5%. Storing the prime numbers themselves in a sorted
list would save memory and allow a potentially more efficient way of
counting the number of primes within some interval.

To see how it saves memory (on a 64 bit system):

$ python
Python 2.7.3 (default, Sep 26 2012, 21:51:14)
[GCC 4.7.2] on linux2
Type help, copyright, credits or license for more information.
 import sys
 a = ([True] + [False]*19) * 5
 len(a)
100
 sys.getsizeof(a)
872
 a = list(range(5))
 sys.getsizeof(a)
450120
 sum(sys.getsizeof(x) for x in a)
120

So you're using about 1/5th of the memory with a list of primes
compared to a list of True/False values. Further savings would be
possible if you used an array to store the primes as 64 bit integers.
In this case it would take about 400MB to store all the primes up to 1
billion.

The more efficient way of counting the primes would then be to use the
bisect module. This gives you a way of counting the primes between a
and b with a cost that is logarithmic in the total number of primes
stored rather than linear in the size of the range (e.g. b-a). For
large enough primes/ranges this is certain to be faster. Whether it
actually works that way for your numbers I can't say.


Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Oscar Benjamin

On 22 April 2013 21:18, Oscar Benjamin oscar.j.benja...@gmail.com wrote:
 On 22 April 2013 17:38, Blind Anagram blindanag...@nowhere.org wrote:
 On 22/04/2013 17:06, Oscar Benjamin wrote:

 I don't know what your application is but I would say that my first
 port of call here would be to consider a different algorithmic
 approach. An obvious question would be about the sparsity of this data
 structure. How frequent are the values that you are trying to count?
 Would it make more sense to store a list of their indices?

 Actually it is no more than a simple prime sieve implemented as a Python
 class (and, yes, I realize that there are plenty of these around).

 If I understand correctly, you have a list of roughly a billion
 True/False values indicating which integers are prime and which are
 not. You would like to discover how many prime numbers there are
 between two numbers a and b. You currently do this by counting the
 number of True values in your list between the indices a and b.

 If my description is correct then I would definitely consider using a
 different algorithmic approach. The density of primes from 1 to 1
 billlion is about 5%. Storing the prime numbers themselves in a sorted
 list would save memory and allow a potentially more efficient way of
 counting the number of primes within some interval.

In fact it is probably quicker if you don't mind using all that memory
to just store the cumulative sum of your prime True/False indicator
list. This would be the prime counting function pi(n). You can then
count the primes between a and b in constant time with pi[b] - pi[a].


Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Blind Anagram

On 22/04/2013 21:18, Oscar Benjamin wrote:
 On 22 April 2013 17:38, Blind Anagram blindanag...@nowhere.org wrote:

[snip]

 If my description is correct then I would definitely consider using a
 different algorithmic approach. The density of primes from 1 to 1
 billlion is about 5%. Storing the prime numbers themselves in a sorted
 list would save memory and allow a potentially more efficient way of
 counting the number of primes within some interval.

That is correct but I need to say that the lengths I have been
describing are limiting cases - almost all of the time the sieve length
will be quite small.

But I was still interested to see if I could push the limit without
changing the essential simplicity of the sieve.

And here the cost of creating the slice (which I have measured) set me
wondering why a list.count(value, limit) function did not exist.

I also wondered whether I had missed any obvious way of avoiding the
slicing cost (intellectually it seemed wrong to me to have to copy the
list in order to count items within it).
[snip]

 
 So you're using about 1/5th of the memory with a list of primes
 compared to a list of True/False values. Further savings would be
 possible if you used an array to store the primes as 64 bit integers.
 In this case it would take about 400MB to store all the primes up to 1
 billion.

I have looked at solutions based on listing primes and here I have found
that they are very much slower than my existing solution when the sieve
is not large (which is the majority use case).

I have also tried counting using a loop such as:

  while i  limit:
i = sieve.index(1, i) + 1
cnt += 1

but this is slower than count even on huge lists.

Thank you again for your advice.

   Brian



-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Blind Anagram

On 22/04/2013 22:03, Oscar Benjamin wrote:
 On 22 April 2013 21:18, Oscar Benjamin oscar.j.benja...@gmail.com wrote:
 On 22 April 2013 17:38, Blind Anagram blindanag...@nowhere.org wrote:
 On 22/04/2013 17:06, Oscar Benjamin wrote:

 I don't know what your application is but I would say that my first
 port of call here would be to consider a different algorithmic
 approach. An obvious question would be about the sparsity of this data
 structure. How frequent are the values that you are trying to count?
 Would it make more sense to store a list of their indices?

 Actually it is no more than a simple prime sieve implemented as a Python
 class (and, yes, I realize that there are plenty of these around).

 If I understand correctly, you have a list of roughly a billion
 True/False values indicating which integers are prime and which are
 not. You would like to discover how many prime numbers there are
 between two numbers a and b. You currently do this by counting the
 number of True values in your list between the indices a and b.

 If my description is correct then I would definitely consider using a
 different algorithmic approach. The density of primes from 1 to 1
 billlion is about 5%. Storing the prime numbers themselves in a sorted
 list would save memory and allow a potentially more efficient way of
 counting the number of primes within some interval.
 
 In fact it is probably quicker if you don't mind using all that memory
 to just store the cumulative sum of your prime True/False indicator
 list. This would be the prime counting function pi(n). You can then
 count the primes between a and b in constant time with pi[b] - pi[a].

I did wonder whether, after creating the sieve, I should simply go
through the list and replace the True values with a count.  This would
certainly speed up the prime count function, which is where the issue
arises.  I will try this and see what sort of performance trade-offs
this involves.

  Brian

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Steven D'Aprano

On Mon, 22 Apr 2013 15:15:19 +0100, Blind Anagram wrote:

 But when using a sub-sequence, I do suffer a significant reduction in
 speed for a count when compared with count on the full list.  When the
 list is small enough not to cause memory allocation issues this is about
 30% on 100,000,000 items.  But when the list is 1,000,000,000 items, OS
 memory allocation becomes an issue and the cost on my system rises to
 over 600%.

Buy more memory :-)


 I agree that this is not a big issue but it seems to me a high price to
 pay for the lack of a sieve.count(value, limit), which I feel is a
 useful function (given that memoryview operations are not available for
 lists).

There is no need to complicate the count method for such a specialised 
use-case. A more general solution would be to provide list views. 


Another solution might be to use arrays rather than lists. Since your 
sieve list is homogeneous, you could possibly use an array of 1 or 0 
bytes rather than a list of True or False bools. That would reduce the 
memory overhead by a factor of four, and similarly reduce the overhead of 
any copying:


py from array import array
py from sys import getsizeof
py L = [True, False, False, True]*1000
py A = array('b', L)
py getsizeof(L)
16032
py getsizeof(A)
4032



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Oscar Benjamin

On 22 April 2013 22:25, Blind Anagram blindanag...@nowhere.org wrote:
 On 22/04/2013 21:18, Oscar Benjamin wrote:
 On 22 April 2013 17:38, Blind Anagram blindanag...@nowhere.org wrote:

 I also wondered whether I had missed any obvious way of avoiding the
 slicing cost (intellectually it seemed wrong to me to have to copy the
 list in order to count items within it).
 [snip]

 I have looked at solutions based on listing primes and here I have found
 that they are very much slower than my existing solution when the sieve
 is not large (which is the majority use case).

What matters is not so much the size of the sieve but the size of the
interval you want to query. You say that slicing cost is somehow
significant which suggests to me that it's not a small interval. An
approach using a sorted list of primes and bisect would have a cost
that is independent of the size of the interval (and depends only
logarithmically on the size of the sieve).


Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Steven D'Aprano

On Mon, 22 Apr 2013 22:25:50 +0100, Blind Anagram wrote:

 I have looked at solutions based on listing primes and here I have found
 that they are very much slower than my existing solution when the sieve
 is not large (which is the majority use case).

Yes. This is hardly surprising. Algorithms suitable for dealing with the 
first million primes are not suitable for dealing with the first trillion 
primes, and vice versa. We like to pretend that computer programming is 
an abstraction, and for small enough data we often can get away with 
that, but like all abstractions eventually it breaks and the cost of 
dealing with real hardware becomes significant.

But I must ask, given that the primes are so widely distributed, why are 
you storing them in a list instead of a sparse array (i.e. a dict)? There 
are 50,847,534 primes less than or equal to 1,000,000,000, so you are 
storing roughly 18 False values for every True value. That ratio will 
only get bigger. With a billion entries, you are using 18 times more 
memory than necessary.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: List Count

2013-04-22 Thread Dave Angel


On 04/22/2013 05:32 PM, Blind Anagram wrote:

On 22/04/2013 22:03, Oscar Benjamin wrote:

On 22 April 2013 21:18, Oscar Benjamin oscar.j.benja...@gmail.com wrote:

On 22 April 2013 17:38, Blind Anagram blindanag...@nowhere.org wrote:

On 22/04/2013 17:06, Oscar Benjamin wrote:


I don't know what your application is but I would say that my first
port of call here would be to consider a different algorithmic
approach. An obvious question would be about the sparsity of this data
structure. How frequent are the values that you are trying to count?
Would it make more sense to store a list of their indices?


Actually it is no more than a simple prime sieve implemented as a Python
class (and, yes, I realize that there are plenty of these around).


If I understand correctly, you have a list of roughly a billion
True/False values indicating which integers are prime and which are
not. You would like to discover how many prime numbers there are
between two numbers a and b. You currently do this by counting the
number of True values in your list between the indices a and b.

If my description is correct then I would definitely consider using a
different algorithmic approach. The density of primes from 1 to 1
billlion is about 5%. Storing the prime numbers themselves in a sorted
list would save memory and allow a potentially more efficient way of
counting the number of primes within some interval.


In fact it is probably quicker if you don't mind using all that memory
to just store the cumulative sum of your prime True/False indicator
list. This would be the prime counting function pi(n). You can then
count the primes between a and b in constant time with pi[b] - pi[a].


I did wonder whether, after creating the sieve, I should simply go
through the list and replace the True values with a count.  This would
certainly speed up the prime count function, which is where the issue
arises.  I will try this and see what sort of performance trade-offs
this involves.



By doing that replacement, you'd increase memory usage manyfold (maybe 
3:1, I don't know).  As long as you're only using bools in the list, you 
only have the list overhead to consider, because all the objects 
involved are already cached (True and False exist only once each).  If 
you have integers, you'll need a new object for each nonzero count.




--
DaveA
--
http://mail.python.org/mailman/listinfo/python-list

41 matches

Mail list logo