Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-16 Thread Jesus Cea
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 13/03/11 13:14, Paul Moore wrote:
 None of my real code is affected either way, but it seems to me that
 the removal of the comparison function option was (sadly) a case of
 purity being allowed to beat practicality. Luckily, adding it back
 wouldn't be a backward compatibility issue :-) Whether it's worth
 doing, I'll leave to those who would be doing the work to judge...

I would +1 to reintroducing cmp. Mapping my mental cmp to a key
is something I must think about everyday. When I have a 200 element
list, I don't mind about the O(nlog(n)) nature of cmp, instead of
O(n) of key.

- -- 
Jesus Cea Avion _/_/  _/_/_/_/_/_/
j...@jcea.es - http://www.jcea.es/ _/_/_/_/  _/_/_/_/  _/_/
jabber / xmpp:j...@jabber.org _/_/_/_/  _/_/_/_/_/
.  _/_/  _/_/_/_/  _/_/  _/_/
Things are not so easy  _/_/  _/_/_/_/  _/_/_/_/  _/_/
My name is Dump, Core Dump   _/_/_/_/_/_/  _/_/  _/_/
El amor es poner tu felicidad en la felicidad de otro - Leibniz
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQCVAwUBTYDhJJlgi5GaxT1NAQKcXQP+IX5bSb3aSNsfzu/vjNY41CNI/1CJ+r3q
vxPbi7ut1ZvXVyiMo2b+jVC1rxNVitf0pQXyQ4skv9Tiq3+L8eYbOts7hbmDB7Aw
sNPWYuvjLmTrX2lsgdQyjbiVK0lHJjzYn8+pvKqy7Tt47+sNiL/07FmM6CHj2fiU
N/vspNCa6rs=
=Zc90
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-13 Thread Paul Moore
On 13 March 2011 03:00, Raymond Hettinger raymond.hettin...@gmail.com wrote:
 But in Python 3 this solution is no longer available. How bad is that?
 I'm not sure. But I'd like to at least get the issue out in the open.


 Python3.2 should be substantially better in this regard.
 It no longer wraps key objects around every input.  Instead, it
 sorts two parallel arrays of pointers.   You can thank Daniel
 Stutzbach (another Googler I believe) for this effort.

For what it's worth:

PS D:\Data D:\Apps\Python27\python.exe -m timeit -s L = [(1,2),
(3,4), (0,5), (9,100), (3,7), (2,8)] sorted(L, lambda (p,q),(r,s):
cmp(p*s, q*r))
10 loops, best of 3: 5.19 usec per loop
PS D:\Data D:\Apps\Python27\python.exe -m timeit -s L = [(1,2),
(3,4), (0,5), (9,100), (3,7), (2,8)] -s from fractions import
Fraction sorted(L, key=lambda t: Fraction(*t))
1 loops, best of 3: 51.6 usec per loop
PS D:\Data D:\Apps\Python27\python.exe -m timeit -s L = [(1,2),
(3,4), (0,5), (9,100), (3,7), (2,8)] -s from fcomp import Fcomp
sorted(L, key=Fcomp)
10 loops, best of 3: 8.6 usec per loop
PS D:\Data D:\Apps\Python32\python.exe -m timeit -s L = [(1,2),
(3,4), (0,5), (9,100), (3,7), (2,8)] -s from fractions import
Fraction sorted(L, key=lambda t: Fraction(*t))
1 loops, best of 3: 64.4 usec per loop
PS D:\Data D:\Apps\Python32\python.exe -m timeit -s L = [(1,2),
(3,4), (0,5), (9,100), (3,7), (2,8)] -s from fcomp import Fcomp
sorted(L, key=Fcomp)
10 loops, best of 3: 8.41 usec per loop

This says to me that using Fraction as a key is substantially worse
(and worse still in 3.2 over 2.7). Using a custom key is closer to a
comparison function (but still 60-70% slower) and is marginally faster
in 3.2. Clearly, the nature of the key is critical here, so take this
with even more of a pinch of salt than you normally would with a
benchmark.

None of my real code is affected either way, but it seems to me that
the removal of the comparison function option was (sadly) a case of
purity being allowed to beat practicality. Luckily, adding it back
wouldn't be a backward compatibility issue :-) Whether it's worth
doing, I'll leave to those who would be doing the work to judge...

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-13 Thread Daniel Stutzbach
On Sat, Mar 12, 2011 at 3:44 PM, Guido van Rossum gu...@python.org wrote:

 I recently advised a Googler who was sorting a large dataset and
 running out of memory. My analysis of the situation was that he was
 sorting a huge list of short lines of the form shortstring,integer
 with a key function that returned a tuple of the form (shortstring,
 integer).


As Raymond pointed out, a change I made for 3.2 significantly shrinks the
memory footprint of sorting with a key (although it's still more
memory-intensive than sorting with cmp).

He could reduce the memory footprint further by sorting in two passes
instead of using a tuple, leveraging the fact that Python guarantees a
stable sort.  In 3.2 or later, this technique will require roughly twice as
much memory as just storing the list:

biglist.sort(key=lambda s: int(s.split(',')[1]))  # Sort by the integer
biglist.sort(key=lambda s: s.split(',')[0])  # Sort by the shortstring

I think the use cases are pretty narrow where there's plenty of memory for
storing the list but not enough to store two copies.

-- 
Daniel Stutzbach
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-13 Thread Daniel Stutzbach
On Sat, Mar 12, 2011 at 9:17 PM, Terry Reedy tjre...@udel.edu wrote:

 But in this case, they are much slower. To be faster, one would need
 something like key=lambda p,q:p*(lcm//q), where lcm is the least common
 multiple of of all the q's (denominators). For the example above, lcm = 700.
 But for longer lists, it could be outrageously large.


You can get away with less precision.  It's okay if the key function loses
some information, as long as it still has enough to distinguish each pair of
numbers.  In other words, you just need a number that's at least as large as
the largest lcm between any pair:

max_denominator_sq = max(q for p, q in fraction_list) ** 2  # Strictly
larger than the lcm between any pair
fraction_list.sort(key=lambda p, q: p*max_denominator_sq//q)

Using the above, key is 4 times faster than cmp in Python2.6:


stutzbach-glaptop:~$ python2.6 -m timeit -s 'fractions = [(i, j) for i in
range(100) for j in range(1, 100)]'  'sorted(fractions, cmp=lambda
(p,q),(r,s): cmp(p*s, q*r))'
10 loops, best of 3: 23.3 msec per loop

stutzbach-glaptop:~$ python2.6 -m timeit -s 'fractions = [(i, j) for i in
range(100) for j in range(1, 100)]' 'max_denominator_sq = max(q for p, q in
fractions)**2' (1,2), (3,4), (0,5), (9,100), (3,7), (2,8)'sorted(fractions,
key=lambda t: t[0]*max_denominator_sq//t[1])'
100 loops, best of 3: 5.52 msec per loop


with a further speed improvement in 3.2:


stutzbach-glaptop:~$ ~/python/cpython-3.2/python -m timeit -s 'fractions =
[(i, j) for i in range(100) for j in range(1, 100)]' 'max_denominator_sq =
max(q for p, q in fractions)**2' 'sorted(fractions, key=lambda t:
t[0]*max_denominator_sq//t[1])'
100 loops, best of 3: 4.89 msec per loop


and more speed improvements with my experimental changes targeting 3.3 (not
yet in the bug tracker):   :-)


stutzbach-glaptop:~$ ~/python/cpython/python -m timeit -s 'fractions = [(i,
j) for i in range(100) for j in range(1, 100)]' 'max_denominator_sq = max(q
for p, q in fractions)**2' 'sorted(fractions, key=lambda t:
t[0]*max_denominator_sq//t[1])'
100 loops, best of 3: 3.73 msec per loop

-- 
Daniel Stutzbach
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-13 Thread Terry Reedy

On 3/13/2011 2:05 PM, Daniel Stutzbach wrote:

On Sat, Mar 12, 2011 at 3:44 PM, Guido van Rossum gu...@python.org
mailto:gu...@python.org wrote:

I recently advised a Googler who was sorting a large dataset and
running out of memory. My analysis of the situation was that he was
sorting a huge list of short lines of the form shortstring,integer
with a key function that returned a tuple of the form (shortstring,
integer).


As Raymond pointed out, a change I made for 3.2 significantly shrinks
the memory footprint of sorting with a key (although it's still more
memory-intensive than sorting with cmp).

He could reduce the memory footprint further by sorting in two passes
instead of using a tuple, leveraging the fact that Python guarantees a
stable sort.  In 3.2 or later, this technique will require roughly twice
as much memory as just storing the list:

biglist.sort(key=lambda s: int(s.split(',')[1]))  # Sort by the integer
biglist.sort(key=lambda s: s.split(',')[0])  # Sort by the shortstring

I think the use cases are pretty narrow where there's plenty of memory
for storing the list but not enough to store two copies.


Here are two variations on the idea of two pass sorting, both of which 
only sort ints for duplicate shortstrings and neither of which use key=.


1. If duplication of shortstrings is (known to be) sparse, sort the 
lines as they are, then scan to detect runs (slices) of duplicate 
shortstrings and sort and replace those. (If sort had start and stop 
index keywords, this could be done in place.)


2. If duplication of shortstrings is (known to be) heavy, read the 
dataset into a shortstring-list_of_ints dict rathar than a list. Sort 
the keys in a separate (much shorted) list and sort each value (list of 
ints) separately. For some post-sort usages, this would be more useful 
than a flat sorted list.


A third idea is to presort the dataset into chunks (a-m, n-z) or however 
appropriate, perhaps as it is gathered; sort each separately; and then 
concatenate. This will handle cases where even the flat list is too big 
for memory.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Reid Kleckner
They should be able to use a slotted cmp_to_key style class:
http://docs.python.org/howto/sorting.html

That will allocate 1 Python object with no dict per key, but that
might not be good enough.

Reid

On Sat, Mar 12, 2011 at 3:44 PM, Guido van Rossum gu...@python.org wrote:
 I was just reminded that in Python 3, list.sort() and sorted() no
 longer support the cmp (comparator) function argument. The reason is
 that the key function argument is always better. But now I have a
 nagging doubt about this:

 I recently advised a Googler who was sorting a large dataset and
 running out of memory. My analysis of the situation was that he was
 sorting a huge list of short lines of the form shortstring,integer
 with a key function that returned a tuple of the form (shortstring,
 integer). Using the key function argument, in addition to N short
 string objects, this creates N tuples of length 2, N more slightly
 shorter string objects, and N integer objects. (Not to count a
 parallel array of N more pointers.) Given the object overhead, this
 dramatically increased the memory usage. It so happens that in this
 particular Googler's situation, memory is constrained but CPU time is
 not, and it would be better to parse the strings over and over again
 in a comparator function.

 But in Python 3 this solution is no longer available. How bad is that?
 I'm not sure. But I'd like to at least get the issue out in the open.

 --
 --Guido van Rossum (python.org/~guido)
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-dev/reid.kleckner%40gmail.com

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Fredrik Johansson
On Sat, Mar 12, 2011 at 9:44 PM, Guido van Rossum gu...@python.org wrote:
 I was just reminded that in Python 3, list.sort() and sorted() no
 longer support the cmp (comparator) function argument. The reason is
 that the key function argument is always better. But now I have a
 nagging doubt about this:

 I recently advised a Googler who was sorting a large dataset and
 running out of memory. My analysis of the situation was that he was
 sorting a huge list of short lines of the form shortstring,integer
 with a key function that returned a tuple of the form (shortstring,
 integer). Using the key function argument, in addition to N short
 string objects, this creates N tuples of length 2, N more slightly
 shorter string objects, and N integer objects. (Not to count a
 parallel array of N more pointers.) Given the object overhead, this
 dramatically increased the memory usage. It so happens that in this
 particular Googler's situation, memory is constrained but CPU time is
 not, and it would be better to parse the strings over and over again
 in a comparator function.

 But in Python 3 this solution is no longer available. How bad is that?
 I'm not sure. But I'd like to at least get the issue out in the open.

The removal of cmp and especially sort(cmp=...) was probably my least
favorite change in Python 3.

Sorting with a key only makes sense when the data can be mapped onto
builtin types (typically ints, strings, and tuples thereof) in an
order-preserving way. When this is not possible, you have to map it
onto a custom type and effectively implement a cmp function via the
comparison methods of that type. This is an ugly and slow substitute
for something that used to be elegant and fast.

Consider sorting a list of pairs representing fractions. This can be
done easily in Python 2.x with the comparison function lambda
(p,q),(r,s): cmp(p*s, q*r). In Python 2.6, this is about 40 times
faster than using fractions.Fraction as a key function.

Fredrik
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Nick Coghlan
On Sat, Mar 12, 2011 at 4:50 PM, Reid Kleckner reid.kleck...@gmail.com wrote:
 They should be able to use a slotted cmp_to_key style class:
 http://docs.python.org/howto/sorting.html

 That will allocate 1 Python object with no dict per key, but that
 might not be good enough.

Tuples are already slotted, so that isn't likely to help in this case.

The basic point is that cmp vs key is actually a classic space vs
speed trade-off, and by removing the cmp option, we prevent people
from making that trade-off for themselves.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Terry Reedy

On 3/12/2011 3:44 PM, Guido van Rossum wrote:

I was just reminded that in Python 3, list.sort() and sorted() no
longer support the cmp (comparator) function argument. The reason is
that the key function argument is always better. But now I have a
nagging doubt about this:

I recently advised a Googler who was sorting a large dataset and
running out of memory. My analysis of the situation was that he was
sorting a huge list of short lines of the form shortstring,integer
with a key function that returned a tuple of the form (shortstring,
integer).


I believe that if the integer field were padded with leading blanks as 
needed so that all are the same length, then no key would be needed.


ll = ['a,1', 'ab,3', 'a,1', 'a,  111']
ll.sort()
print(ll)

['a,1', 'a,  111', 'a,1', 'ab,3']

If most ints are near the max len, this would add little space, and be 
even faster than with the key.


 Using the key function argument, in addition to N short

string objects, this creates N tuples of length 2, N more slightly
shorter string objects, and N integer objects. (Not to count a
parallel array of N more pointers.) Given the object overhead, this
dramatically increased the memory usage. It so happens that in this
particular Googler's situation, memory is constrained but CPU time is
not, and it would be better to parse the strings over and over again
in a comparator function.


Was 3.2 used? It has a patch that reduces the extra memory that might 
not be in the last 3.1 release.



But in Python 3 this solution is no longer available. How bad is that?
I'm not sure. But I'd like to at least get the issue out in the open.


This removal has been one of the more contentious issues about (not) 
using 3.x. I believe Raymond had been more involved in the defense of 
the decision than I. However, the discussion/complaint has mostly been 
about the relative difficulty of writing a key function versus a compare 
function.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Reid Kleckner
On Sat, Mar 12, 2011 at 4:58 PM, Nick Coghlan ncogh...@gmail.com wrote:
 On Sat, Mar 12, 2011 at 4:50 PM, Reid Kleckner reid.kleck...@gmail.com 
 wrote:
 They should be able to use a slotted cmp_to_key style class:
 http://docs.python.org/howto/sorting.html

 That will allocate 1 Python object with no dict per key, but that
 might not be good enough.

 Tuples are already slotted, so that isn't likely to help in this case.

It's three allocations vs. one.  The first is tuple + str + int, while
the adapter is just one object.  I'm not sure how it eventually shakes
out, though.

That said, it's still worse than Python 2, which is zero allocations.  :)

Reid
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Eugene Toder
Can sort have an option (and/or try to figure it itself) to calculate
key for every comparison instead of caching them? This will have the
same memory requirements as with cmp, but doesn't require rewriting
code if you decide to trade speed for memory. Will this be much slower
than with cmp?

If going that route sort can also cache a limited amount of keys
(instead of all or nothing), using, for example, a LRU cache with
fixed size.

Eugene
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Martin v. Löwis

Am 12.03.11 16:58, schrieb Nick Coghlan:

On Sat, Mar 12, 2011 at 4:50 PM, Reid Klecknerreid.kleck...@gmail.com  wrote:

They should be able to use a slotted cmp_to_key style class:
http://docs.python.org/howto/sorting.html

That will allocate 1 Python object with no dict per key, but that
might not be good enough.


Tuples are already slotted, so that isn't likely to help in this case.


Why not? IIUC, the current key function creates three objects: the 
tuple, the short string, and the int. With the class


class cmp_to_key:
  __slots__=['obj']
  def __init__(self, obj):
  self.obj = obj
  def __lt__(self):
  ...

you would only create a single object, so you save the string and the 
integer. In addition, on a 64-bit system, the size of a cmp_to_key

instance is 56 bytes, whereas a two-tuple is 72 bytes, so you also
save 16 bytes per object. Whether that would already create a sufficient
reduction for the case at hand, I don't know.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Peter Otten
Guido van Rossum wrote:

 I was just reminded that in Python 3, list.sort() and sorted() no
 longer support the cmp (comparator) function argument. The reason is
 that the key function argument is always better. But now I have a
 nagging doubt about this:
 
 I recently advised a Googler who was sorting a large dataset and
 running out of memory. My analysis of the situation was that he was
 sorting a huge list of short lines of the form shortstring,integer
 with a key function that returned a tuple of the form (shortstring,
 integer). Using the key function argument, in addition to N short
 string objects, this creates N tuples of length 2, N more slightly
 shorter string objects, and N integer objects. (Not to count a
 parallel array of N more pointers.) Given the object overhead, this
 dramatically increased the memory usage. It so happens that in this
 particular Googler's situation, memory is constrained but CPU time is
 not, and it would be better to parse the strings over and over again
 in a comparator function.
 
 But in Python 3 this solution is no longer available. How bad is that?
 I'm not sure. But I'd like to at least get the issue out in the open.

While there are other arguments to reintroduce cmp (or less_than instead?) 
the memory problem could also be addressed with a dont_cache_keys flag or 
max_cache_keys limit.

Peter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Nick Coghlan
On Sat, Mar 12, 2011 at 5:41 PM, Martin v. Löwis mar...@v.loewis.de wrote:
 Why not? IIUC, the current key function creates three objects: the tuple,
 the short string, and the int. With the class

Yeah, I misread the example. Using cmp_to_key would indeed save quite
a lot of memory in this case.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Martin v. Löwis

But in Python 3 this solution is no longer available. How bad is that?
I'm not sure. But I'd like to at least get the issue out in the open.


Rather than reintroducing cmp=, I'd add a cached=True parameter.
If this is set to False, the key results wouldn't be put into a
list, but recreated every time two objects need to be compared.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Glenn Linderman

On 3/12/2011 1:55 PM, Fredrik Johansson wrote:

Consider sorting a list of pairs representing fractions. This can be
done easily in Python 2.x with the comparison function lambda
(p,q),(r,s): cmp(p*s, q*r). In Python 2.6, this is about 40 times
faster than using fractions.Fraction as a key function.


Am I correct in concluding that various ideas to eliminate or limit the 
size of the key= cache would not help this use case at all?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Martin v. Löwis

Am 12.03.11 18:00, schrieb Glenn Linderman:

  On 3/12/2011 1:55 PM, Fredrik Johansson wrote:

Consider sorting a list of pairs representing fractions. This can be
done easily in Python 2.x with the comparison function lambda
(p,q),(r,s): cmp(p*s, q*r). In Python 2.6, this is about 40 times
faster than using fractions.Fraction as a key function.


Am I correct in concluding that various ideas to eliminate or limit the
size of the key= cache would not help this use case at all?


That's correct. However, there is a straight-forward day of getting
the same comparison algorithm with the cmp_to_key class in 3.x.
Fredrik classified this as ugly and slow; I'm not sure where this
classification comes from.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Fredrik Johansson
On Sun, Mar 13, 2011 at 12:41 AM, Martin v. Löwis mar...@v.loewis.de wrote:
 Am 12.03.11 18:00, schrieb Glenn Linderman:

  On 3/12/2011 1:55 PM, Fredrik Johansson wrote:

 Consider sorting a list of pairs representing fractions. This can be
 done easily in Python 2.x with the comparison function lambda
 (p,q),(r,s): cmp(p*s, q*r). In Python 2.6, this is about 40 times
 faster than using fractions.Fraction as a key function.

 Am I correct in concluding that various ideas to eliminate or limit the
 size of the key= cache would not help this use case at all?

 That's correct. However, there is a straight-forward day of getting
 the same comparison algorithm with the cmp_to_key class in 3.x.
 Fredrik classified this as ugly and slow; I'm not sure where this
 classification comes from.

It's ugly because it involves creating a class wrapping a comparison
function, or importing some obscure magic from the standard library,
instead of just using a comparison function.

It's slow because it's slow (and less memory-efficient). Even with
cmp_to_key, the example with fractions is still 2.6 times slower than
with a direct cmp function.

Fredrik
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Terry Reedy

On 3/12/2011 5:09 PM, Reid Kleckner wrote:

On Sat, Mar 12, 2011 at 4:58 PM, Nick Coghlanncogh...@gmail.com  wrote:

On Sat, Mar 12, 2011 at 4:50 PM, Reid Klecknerreid.kleck...@gmail.com  wrote:

They should be able to use a slotted cmp_to_key style class:
http://docs.python.org/howto/sorting.html

That will allocate 1 Python object with no dict per key, but that
might not be good enough.


Tuples are already slotted, so that isn't likely to help in this case.


It's three allocations vs. one.  The first is tuple + str + int, while
the adapter is just one object.  I'm not sure how it eventually shakes
out, though.

That said, it's still worse than Python 2, which is zero allocations.  :)


And revising the data so that no key and no cmp function is needed is 
zero allocations and faster. See my other post.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Steven D'Aprano

Fredrik Johansson wrote:


Consider sorting a list of pairs representing fractions. This can be
done easily in Python 2.x with the comparison function lambda
(p,q),(r,s): cmp(p*s, q*r). In Python 2.6, this is about 40 times
faster than using fractions.Fraction as a key function.



[steve@sylar ~]$ python2.7 -m timeit -s L = [(1,2), (3,4), (0,5), 
(9,100), (3,7), (2,8)] sorted(L, lambda (p,q),(r,s): cmp(p*s, q*r))

1 loops, best of 3: 25.1 usec per loop

[steve@sylar ~]$ python2.7 -m timeit -s L = [(1,2), (3,4), (0,5), 
(9,100), (3,7), (2,8)] -s from fractions import Fraction sorted(L, 
key=lambda t: Fraction(*t))

1000 loops, best of 3: 236 usec per loop


So for a short list, I get a factor of ten difference. For a longer 
list, I'd expect the key function to win out. Much to my astonishment, 
it doesn't -- I get similar results regardless of the size of L.


Size of L   key/cmp
==  =
6   9.4
600 13.9
6   7.0
600 6.7




--
Steven

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Glenn Linderman

On 3/12/2011 2:09 PM, Terry Reedy wrote:
I believe that if the integer field were padded with leading blanks as 
needed so that all are the same length, then no key would be needed. 


Did you mean that if the integer field were converted to string and 
padded with leading blanks...?


Otherwise I'm not sure how to pad an integer with leading blanks.

Also, what appears to be your revised data structure,   strval + ',' + 
'%5.5d' % intval , assumes the strval is fixed length, also.  Consider  
the following strval, intval pairs, using your syntax:


['a,997,1','a, 1000']

Nothing says the strval wouldn't contain data that look like your 
structure... and just because they were short, didn't mean they were 
fixed length.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Terry Reedy

On 3/12/2011 8:28 PM, Steven D'Aprano wrote:

Fredrik Johansson wrote:


Consider sorting a list of pairs representing fractions. This can be
done easily in Python 2.x with the comparison function lambda
(p,q),(r,s): cmp(p*s, q*r). In Python 2.6, this is about 40 times
faster than using fractions.Fraction as a key function.



[steve@sylar ~]$ python2.7 -m timeit -s L = [(1,2), (3,4), (0,5),
(9,100), (3,7), (2,8)] sorted(L, lambda (p,q),(r,s): cmp(p*s, q*r))
1 loops, best of 3: 25.1 usec per loop

[steve@sylar ~]$ python2.7 -m timeit -s L = [(1,2), (3,4), (0,5),
(9,100), (3,7), (2,8)] -s from fractions import Fraction sorted(L,
key=lambda t: Fraction(*t))
1000 loops, best of 3: 236 usec per loop


So for a short list, I get a factor of ten difference. For a longer
list, I'd expect the key function to win out. Much to my astonishment,
it doesn't -- I get similar results regardless of the size of L.

Size of L key/cmp
== =
6 9.4
600 13.9
6 7.0
600 6.7


Interesting. Comparing two Fractions must do the same thing as the cmp 
function, compare two products, but it does so with a *lot* more overhead:


571 def _richcmp(self, other, op):
572 Helper for comparison operators, for internal use
574 Implement comparison between a Rational instance and
575 either another Rational instance or a float `other`. If
576 `other` is not a Rational instance or a float, return
577 NotImplemented. `op` should be one of the six standard
578 comparison operators.
580 
581 # convert other to a Rational instance where reasonable.
582 if isinstance(other, numbers.Rational):
583 return op(self._numerator * other.denominator,
584   self._denominator * other.numerator)
585 if isinstance(other, float):
586 if math.isnan(other) or math.isinf(other):
587 return op(0.0, other)
588 else:
589 return op(self, self.from_float(other))
590 else:
591 return NotImplemented
592
593 def __lt__(a, b):
594 a  b
595 return a._richcmp(b, operator.lt)

For this example, and any with suitably restricted values, key=float 
would work and should beat the cmp version. But of course, someone will 
have a use case for which that will not work. (But then, one could do a 
near linear scan to properly re-sort slices with equal float keys.)


Hmm. As I remember, one rationale for deleting cmp is that nlogn slow 
cmp() calls are slower than n slow key() calls, but that only matters if 
the nlogn '' comparisions of stored keys are fast compared to cmp calls 
(as tends to be the case when the keys are builtin class instances).


But in this case, they are much slower. To be faster, one would need 
something like key=lambda p,q:p*(lcm//q), where lcm is the least 
common multiple of of all the q's (denominators). For the example above, 
lcm = 700. But for longer lists, it could be outrageously large.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Martin v. Löwis

[steve@sylar ~]$ python2.7 -m timeit -s L = [(1,2), (3,4), (0,5),
(9,100), (3,7), (2,8)] sorted(L, lambda (p,q),(r,s): cmp(p*s, q*r))
1 loops, best of 3: 25.1 usec per loop

[steve@sylar ~]$ python2.7 -m timeit -s L = [(1,2), (3,4), (0,5),
(9,100), (3,7), (2,8)] -s from fractions import Fraction sorted(L,
key=lambda t: Fraction(*t))
1000 loops, best of 3: 236 usec per loop


So for a short list, I get a factor of ten difference. For a longer
list, I'd expect the key function to win out. Much to my astonishment,
it doesn't -- I get similar results regardless of the size of L.


This shouldn't be surprising. The cost is primarily in the comparisons,
not in the creation of the Fraction objects. Now, the number of
comparisons won't change whether you use a cmp function or key objects;
the algorithm will compare and switch the same objects in the same
order. So if a Fraction.__lt__ takes seven times as long as a cmp call,
this ratio is preserved even for very long lists.

A lot can be saved if the __lt__ would assume that the other object
is of the same kind:

class Fcomp(tuple):
def __lt__(self, other):
return self[0]*other[1]  self[1]*other[0]

python -m timeit -s L = [(1,2), (3,4), (0,5), (9,100), (3,7), (2,8)] 
-s from fcomp import Fcomp sorted(L, key=Fcomp)


Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Raymond Hettinger

On Mar 12, 2011, at 3:44 PM, Guido van Rossum wrote:

 I was just reminded that in Python 3, list.sort() and sorted() no
 longer support the cmp (comparator) function argument. The reason is
 that the key function argument is always better. But now I have a
 nagging doubt about this:
 
 I recently advised a Googler who was sorting a large dataset and
 running out of memory. 

. . .

 But in Python 3 this solution is no longer available. How bad is that?
 I'm not sure. But I'd like to at least get the issue out in the open.
 


Python3.2 should be substantially better in this regard.
It no longer wraps key objects around every input.  Instead, it
sorts two parallel arrays of pointers.   You can thank Daniel
Stutzbach (another Googler I believe) for this effort.


Raymond

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Terry Reedy

On 3/12/2011 8:47 PM, Glenn Linderman wrote:

On 3/12/2011 2:09 PM, Terry Reedy wrote:

I believe that if the integer field were padded with leading blanks as
needed so that all are the same length, then no key would be needed.


Did you mean that if the integer field were converted to string and
padded with leading blanks...?


Guido presented a use case of a list a strings, each of form '%s,%d', 
where the %s part is a 'word'. 'Integer field' refers to the part of 
each string after the comma.



Otherwise I'm not sure how to pad an integer with leading blanks.


The integers are already in string form. The *existing* key function his 
colleague used converted that part to an int as the second part of a 
tuple. I presume the integer field was separated by split(','), so the 
code was something like


def sikey(s):
  s,i = s.split(',')
  return s,int(i)

longlist.sort(key=sikey)

It does not matter if the splitting method is more complicated, because 
it is already part of the problem spec. I proposed instead


def sirep(s):
  s,i = s.split(',') # or whatever current key func does
  return '%s,%#s' % (s,i)
  # where appropriate value of # is known from application

longlist = map(sirep, longlist)
longlist.sort()

# or assuming that a simple split is correct

longlist = ['%s,%#s' % tuple(s.split(',')) for s in longlist]
longlist.sort()


Also, what appears to be your revised data structure, strval + ',' +
'%5.5d' % intval , assumes the strval is fixed length, also.


No it does not, and need not. ',' precedes all letters in ascii order. 
(Ok, I assumed that the 'word' field does not include any of 
!#$%'()*+. If that is not true, replace comma with space or even a 
control char such as '\a' which even precedes \t and \n.) Given the 
context of Google, I assumed that 'word' meant word, as in a web 
document, while the int might be a position or doc number (or both). The 
important point is that the separator cause all word-int pairs with the 
same word to string-sort before all word-int pairs with the same word + 
a suffix. My example intentionally tested that.



Consider the following strval, intval pairs, using your syntax:

['a,997, 1','a, 1000']

Nothing says the strval wouldn't contain data that look like your
structure...


The problem as presented. 'a,997' is not a word. In any case, as I said 
before, the method of correctly parsing the strings into two fields is 
already specified. I am only suggesting a change in how to proceed 
thereafter.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Glenn Linderman

On 3/12/2011 7:21 PM, Terry Reedy wrote:
(Ok, I assumed that the 'word' field does not include any of 
!#$%'()*+. If that is not true, replace comma with space or even a 
control char such as '\a' which even precedes \t and \n.)


OK, I agree the above was your worst assumption, although you need to 
add , to your list also, because that allows for the data puns.


You also rewrote Guido's text from shortstring to word and assumed 
it had certain content semantics, but since only integer is after the 
,, rsplit would work to separate the fields even if shortstring 
contains ,.


And the choice of delimiter really determines whether data puns can 
exist.  If and only if you know that there is a character that is lower 
in sort order than any of the characters in the sort strings, can you 
cheat and put a variable length string into a sort key field, by 
terminating it with such a character.  The safest such character is \0, 
unless you are coding in C, then \a as you now suggest, but only if you 
can be 100% sure it is not found in the data.  If you cannot guarantee 
the data doesn't contain them, there will be the possibility of data 
puns among variable length strings, and the algorithms will sort wrong 
in pathological cases.


I wouldn't have called you on this, except that it really is important 
not to give people the idea that you can blithely use a variable length 
string anywhere except at the tail of a multi-field sort string.  In 
general, you can't.  I've long since lost track of the number of times 
I've helped people understand the fix to programs that tried that.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 regret about deleting list.sort(cmp=...)

2011-03-12 Thread Terry Reedy

On 3/12/2011 10:52 PM, Glenn Linderman wrote:

On 3/12/2011 7:21 PM, Terry Reedy wrote:



The safest such character is \0,\


Works fine in Python.


unless you are coding in C,


Then \01 is next best.


I wouldn't have called you on this, except that it really is important
not to give people the idea that you can blithely use a variable length
string anywhere except at the tail of a multi-field sort string.


Sorry, my initial brief comment was for people on this list who I 
assumed understood the issue.



general, you can't. I've long since lost track of the number of times
I've helped people understand the fix to programs that tried that.


Thanks for explaining. I also get fussy about things I have explained 
too many time ;-).


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com