[issue6614] heapq.nsmallest and nlargest should be smarter/more usable/more consistent

2009-07-31 Thread Joshua Bronson

New submission from Joshua Bronson :

>From http://docs.python.org/library/heapq.html:
> The latter two functions (nlargest and nsmallest) perform best for
> smaller values of n. For larger values, it is more efficient to use
> the sorted() function. Also, when n==1, it is more efficient to use
> the builtin min() and max() functions.

A quick usability win for the heapq module: Be more specific than "smaller" and 
"larger", e.g. "when n is O(...(len(iterable)))".

>From http://groups.google.com/group/comp.lang.python/msg/9556f56ae25ecf3b:
> I find it interesting that the heapq functions tell you in the
> documentation that they aren't suitable for use where n==1 or where
> n is near the total size of the sequence whereas random.sample()
> chooses what it thinks is the best algorithm based on the input. At
> the very least I would have thought the heapq functions could check
> for n==1 and call min/max when it is.

+1. It looks like the pure Python implementation of nsmallest is actually 
already 
choosing either an insort algorithm or a minheap algorithm based on whether n * 
10 <= 
len(iterable), but the C implementation of nsmallest unconditionally uses a 
minheap 
algorithm. Neither the pure Python nor the C implementation of nlargest chooses 
its 
algorithm conditionally. For the sake of consistency and usability, all 
implementations of nsmallest and nlargest should choose the most efficient 
algorithm 
from among min/max, insort, and minheap, and the docs should be updated to 
describe 
this behavior.

Also, in looking through the heapq.py that came with my Python 2.6.2 
distribution, I 
noticed that line 196 seems to have no reason to be there:
_heappushpop = heappushpop  
This is the only occurrence of "_heappushpop" in the file.

I made a patch for heapq.py, but since I'm not a CPython hacker, I thought it 
might 
be better to leave the changes up to someone who could do both the pure Python 
and 
the C implementations in tandem. I'd be happy to contribute documentation when 
the 
time comes if that would help.

--
messages: 91134
nosy: jab
severity: normal
status: open
title: heapq.nsmallest and nlargest should be smarter/more usable/more 
consistent
type: behavior
versions: Python 2.5, Python 2.6, Python 2.7, Python 3.0, Python 3.1

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6614] heapq.nsmallest and nlargest should be smarter/more usable/more consistent

2009-07-31 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

FWIW, 2.7 and 3.1 already have automatic selection of sort()/min()/max()
alternatives.  They use pure python to dispatch to the underlying C
functions:

http://svn.python.org/view/python/branches/release31-maint/Lib/heapq.py?revision=73579&view=markup

--
assignee:  -> rhettinger
nosy: +rhettinger
resolution:  -> out of date
status: open -> closed
type: behavior -> performance
versions: +Python 3.2 -Python 2.5, Python 2.6, Python 3.0, Python 3.1

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6614] heapq.nsmallest and nlargest should be smarter/more usable/more consistent

2009-07-31 Thread Joshua Bronson

Joshua Bronson  added the comment:

Oh, that's great!

(I also noticed that the previously inutile line "_heappushpop = heappushpop" 
is now doing something in the heapq.py you linked to, nice.)

It looks like the docs haven't been updated yet though. For instance, 
http://docs.python.org/3.1/library/heapq.html still says "The latter two 
functions perform best for smaller values of n. For larger values, it is more 
efficient to use the sorted() function. Also, when n==1, it is more efficient 
to use the builtin min() and max() functions."

Also, I notice the pure Python implementation of nsmallest still does that 
check to see if n * 10 <= len(iterable), and if so uses an insort-based 
algorithm. It looks like this is still undocumented, inconsistent with the C 
implementation, and asymmetrical to the implementations of nlargest. If this 
branch is remaining there on purpose, I'd love see its existence mentioned and 
its theoretical basis explained in the docs, along with any similar branches 
implemented elsewhere in the code, if they should be.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6614] heapq.nsmallest and nlargest should be smarter/more usable/more consistent

2009-07-31 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

I prefer the docs the way they are.  They help the reader understand the
relationship between min, max, nsmallest, nlargest, and sorted.

I'm not sure where you got the n * 10 <= len(iterable) switch-over
point.  That is arbitrary.  The correct switchover point depends on the
cost of the comparison function, whether the length of the input is
known, whether the input data is partially ordered, memory constraints,
whether a key function is used, and on other factors.   

FWIW, I also wrote the logic for random.sample().  The switchover logic
was straight-forward because performance depended on factors that were
fully known (length of input, sample size, memory utilization, and
average number of probes for each strategy).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6614] heapq.nsmallest and nlargest should be smarter/more usable/more consistent

2009-07-31 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

One other thought:  When memory is tight, the programmer needs to be
able to select the heap algorithm in favor of sorted() even for
relatively large values of n.  I do not want an automatic switchover
point that takes away a programmer's choice between speed and space
efficiency.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6614] heapq.nsmallest and nlargest should be smarter/more usable/more consistent

2009-07-31 Thread Joshua Bronson

Joshua Bronson  added the comment:

> I prefer the docs the way they are.  They help the reader understand
> the relationship between min, max, nsmallest, nlargest, and sorted.

Except that it's no longer true that "when n==1, it is more efficient to use 
the 
builtin min() and max() functions." Shouldn't this be updated to say "when 
n==1, 
it is equivalent to using the builtin min() and max() functions"?

> I'm not sure where you got the n * 10 <= len(iterable) switch-over
> point.

It's right in the file you linked to. Search for "n * 10" in 
http://svn.python.org/view/python/branches/release31-maint/Lib/heapq.py?
revision=73579&view=markup.

> That is arbitrary.  The correct switchover point depends on the
> cost of the comparison function, whether the length of the input is
> known, whether the input data is partially ordered, memory constraints,
> whether a key function is used, and on other factors.   

So should that be removed, then?

> FWIW, I also wrote the logic for random.sample().  The switchover logic
> was straight-forward because performance depended on factors that were
> fully known (length of input, sample size, memory utilization, and
> average number of probes for each strategy).
> One other thought:  When memory is tight, the programmer needs to be
> able to select the heap algorithm in favor of sorted() even for
> relatively large values of n.  I do not want an automatic switchover
> point that takes away a programmer's choice between speed and space
> efficiency.

Agreed, and thanks for the additional info.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6614] heapq.nsmallest and nlargest should be smarter/more usable/more consistent

2009-07-31 Thread Joshua Bronson

Joshua Bronson  added the comment:

One more thing:

> I prefer the docs the way they are.  They help the reader understand
> the relationship between min, max, nsmallest, nlargest, and sorted.

The docs still use the unspecific language "for smaller values of n" and 
"for larger values". I think careful readers would appreciate an addition 
along the lines of what you wrote earlier -- the optimal switchover point 
depends on the cost of the comparison function, the ordering of the input 
data, etc.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6614] heapq.nsmallest and nlargest should be smarter/more usable/more consistent

2009-07-31 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

> Except that it's no longer true that "when n==1, it is
> more efficient to use the builtin min() and max() functions."

There's still the dispatch overhead.
If someone needs a n==1 case, they
*should* use min/max for both speed
and clarity.

Also, it is important the the docs
communicate the relationship between
min/max, nlargest/nsmallest, and sorted.


> It's right in the file you linked to. Search for "n * 10" in ...

That is in the pure python version of nsmallest() and that
code is not used (it is overriden by the C version).  The
actual C implementation works differently -- it uses an
underlying maxheap so it can use a cleaner algorithm that
doesn't need switchover tricks.

If you feel like "kicking the ball around", please continue the
discussion on comp.lang.python -- I think we're done here.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6614] heapq.nsmallest and nlargest should be smarter/more usable/more consistent

2009-07-31 Thread Joshua Bronson

Joshua Bronson  added the comment:

> That is in the pure python version of nsmallest() and that
> code is not used (it is overriden by the C version).

So just because it isn't used by CPython it should remain in there even 
though as you said yourself it's completely without basis? What about non 
CPython? I'm not trying to "kick the ball around" here, we're both on the 
same team just trying to make Python better.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6614] heapq.nsmallest and nlargest should be smarter/more usable/more consistent

2009-07-31 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

There is a basis for the pure python version to switch to bisect.

There is not a basis for having the final wrapped C function switch to
using sorted().  That is a programmer decision.

The pure python version is there to show how it could be done using a
minheap and it is a bit of hack to get it to work at all.  The C version
uses a maxheap and does not need logic for switching to bisect.  It is
clean and fast.

I'm sorry, but I've lost interest in continuing this discussion.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com