Re: ANN: blist 1.2.0

2010-07-23 Thread Daniel Stutzbach
On Wed, Jul 21, 2010 at 9:47 AM, Daniel Stutzbach <
dan...@stutzbachenterprises.com> wrote:

> What's new?
> ---
>
> - blist.sort() is now *substantially* faster than list.sort() when using
> int or float keys (O(n) vs. O(n log n))
> - The sortedset, sortedlist, and sorteddict types have been revamped for
> better compatibility with the standard library's types.
> - Comprehensive reference documentation is now available at
> http://stutzbachenterprises.com/blist/
> - Numerous other speed improvements
>

Quick addendum:

I just uploaded blist 1.2.1 which fixes a compilation error on BSD-derived
systems (notably MacOS X).
--
Daniel Stutzbach, Ph.D.
President, Stutzbach Enterprises, LLC 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: ANN: blist 1.2.0

2010-07-22 Thread Daniel Stutzbach
On Thu, Jul 22, 2010 at 6:52 PM, Terry Reedy  wrote:

> Another sort of issue will be code maintainability. Two algorithms is
> potentially more problems than one. To me, that partly depends on how well
> factored the current code is.
>

Two algorithms is more code than one algorithm.  No way around that. :-)

Radix sort is a relatively simple sorting algorithm, so the extra complexity
is not too bad.  Once I have a patch ready, others can decide if the
additional complexity is worth the performance boost.


> It would be best is rsort were a switch replacement for timsort after all
> preparations (such as decorate) were done.
>

Yes, that's pretty much what I have in mind.  In pseudo-code:

sort:
decorate()
if all_correct_type and n > 40:
radix_sort()
else
timsort()
undecorate()


> And that is about all I can think of.


Thank you for the thoughts.  I appreciate them!
--
Daniel Stutzbach, Ph.D.
President, Stutzbach Enterprises, LLC 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: ANN: blist 1.2.0

2010-07-22 Thread Terry Reedy

On 7/22/2010 7:22 PM, Daniel Stutzbach wrote:

On Thu, Jul 22, 2010 at 5:25 PM, Terry Reedy 


That's a good point.  It's tempting to add an undocumented parameter to
blist.sort that selects the sorting algorithm to use, to make it make it
easier to test multiple algorithms.  There are probably several
different ways to achieve a similar effect.  Do you mind if we table
that discussion until I actually have a patch?


Definitely. That will not be my decision. Since you seem serious about 
this, I decided to give you a preview of the questions to expect, and 
suggest some to answer in your initial filing.


Another sort of issue will be code maintainability. Two algorithms is 
potentially more problems than one. To me, that partly depends on how 
well factored the current code is. It would be best is rsort were a 
switch replacement for timsort after all preparations (such as decorate) 
were done. I will leave that for you to address when you file.


And that is about all I can think of.

--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


Re: ANN: blist 1.2.0

2010-07-22 Thread Daniel Stutzbach
On Thu, Jul 22, 2010 at 5:25 PM, Terry Reedy  wrote:

> Looks good so far. I would like to see that repeated all the way down to
> range(10) to make sure people doing millions of small sorts were not getting
> screwed.
>

I only use the radix sort for n > 40. :-)


> Have you run a patched version against test_sort.py? I believe it mostly
> tests lists of small ints, so radix methods would mostly be switched in.
>

For testing purposes, I maintain a private fork of Python where I've
replaced the built-in list with blist.  I then run Python's test suite as a
way of testing blist.  So, yes, all of the tests have been run.  :-)
 However, since I only use radix for n > 40, many of the tests have not
actually tested the radix sort.


> If it were added and the switching were internal, new test cases would be
> needed to test test timsort.


That's a good point.  It's tempting to add an undocumented parameter to
blist.sort that selects the sorting algorithm to use, to make it make it
easier to test multiple algorithms.  There are probably several different
ways to achieve a similar effect.  Do you mind if we table that discussion
until I actually have a patch?
--
Daniel Stutzbach, Ph.D.
President, Stutzbach Enterprises, LLC 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: ANN: blist 1.2.0

2010-07-22 Thread Mark Lawrence

On 22/07/2010 23:25, Terry Reedy wrote:

On 7/21/2010 6:56 PM, Daniel Stutzbach wrote:

On Wed, Jul 21, 2010 at 5:27 PM, Terry Reedy mailto:tjre...@udel.edu>> wrote:

These tests use random numbers with a constant, relatively high
density of 25%, which is favorable to radix sort. What if you do the
same test with a constant range of, say, 10 (1 billion) or
even a trillion or quadrillion. Or how about sorting a thousand
128bit ip6 addresses? Which wins for that?



blist switches to radix sort only if the keys contain only floats or
only integers that fit into a C long. If the integers get too big, it
reverts to timsort.

When using a sort key, the decorate-sort-undecorate pattern requires
touching every object once before the sort. The check for a consistent
type is done inexpensively during the decorate phase.


And it is pretty cheap even when there is no decorate phase.


Here's an example result with a density of only 1%, where the radix sort
is around 4 times as fast:

otto:~/src/blist$ python3.1 -m timeit -s 'import random' -s 'x =
[random.randrange(1*100) for i in range(1)]' 'y = list(x)'
'y.sort(key=float)'
100 loops, best of 3: 12.4 msec per loop

otto:~/src/blist$ python3.1 -m timeit -s 'from blist import blist' -s
'import random' -s 'x = [random.randrange(1*100) for i in
range(1)]' 'y = blist(x)' 'y.sort(key=float)'
100 loops, best of 3: 3.4 msec per loop



And a density of 0.01%:

otto:~/src/blist$ python3.1 -m timeit -s 'import random' -s 'x =
[random.randrange(1*1) for i in range(1)]' 'y = list(x)'
'y.sort(key=float)'
100 loops, best of 3: 12 msec per loop

otto:~/src/blist$ python3.1 -m timeit -s 'from blist import blist' -s
'import random' -s 'x = [random.randrange(1*1) for i in
range(1)]' 'y = blist(x)' 'y.sort(key=float)'
100 loops, best of 3: 3.52 msec per loop


Looks good so far. I would like to see that repeated all the way down to
range(10) to make sure people doing millions of small sorts were not
getting screwed.


list.sort is (near) linear for lists that are mostly ordered. I
think Tim's writeup about this in in the source. For instance, if
one extends a sorted with 1% random additions and resorts, list.sort
will skip the sorted part, sort the additions, and merge them back
in. Radix sort ignores pre-existing order. So either a lot of
comparitive profiling would be needed for auto-selection of sort
method, or it should be user selectable.
I've tested exactly that scenario. For already-ordered lists, radix
sort and timsort tie.


Tim tested about 7 list structure scenarios with a range of lengths. If
you post a patch, I would request that you do the same.

Have you run a patched version against test_sort.py? I believe it mostly
tests lists of small ints, so radix methods would mostly be switched in.

If it were added and the switching were internal, new test cases would
be needed to test test timsort.


Does your radix sort meet the stability guarantee of list.sort?

Yes.


Great. There is, of course, a test for that in the suite.



Can I please book front row tickets for the shoot out between the 
reigning heavyweight champ Timbot and the challenger, the up and coming 
Danbot? :)


Kindest regards.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: ANN: blist 1.2.0

2010-07-22 Thread Terry Reedy

On 7/21/2010 6:56 PM, Daniel Stutzbach wrote:

On Wed, Jul 21, 2010 at 5:27 PM, Terry Reedy mailto:tjre...@udel.edu>> wrote:

These tests use random numbers with a constant, relatively high
density of 25%, which is favorable to radix sort. What if you do the
same test with a constant range of, say, 10 (1 billion) or
even a trillion or quadrillion. Or how about sorting a thousand
128bit ip6 addresses? Which wins for that?



blist switches to radix sort only if the keys contain only floats or
only integers that fit into a C long.  If the integers get too big, it
reverts to timsort.

When using a sort key, the decorate-sort-undecorate pattern requires
touching every object once before the sort.  The check for a consistent
type is done inexpensively during the decorate phase.


And it is pretty cheap even when there is no decorate phase.


Here's an example result with a density of only 1%, where the radix sort
is around 4 times as fast:

otto:~/src/blist$ python3.1 -m timeit -s 'import random' -s 'x =
[random.randrange(1*100) for i in range(1)]' 'y = list(x)'
'y.sort(key=float)'
100 loops, best of 3: 12.4 msec per loop

otto:~/src/blist$ python3.1 -m timeit -s 'from blist import blist' -s
'import random' -s 'x = [random.randrange(1*100) for i in
range(1)]' 'y = blist(x)' 'y.sort(key=float)'
100 loops, best of 3: 3.4 msec per loop



And a density of 0.01%:

otto:~/src/blist$ python3.1 -m timeit -s 'import random' -s 'x =
[random.randrange(1*1) for i in range(1)]' 'y = list(x)'
'y.sort(key=float)'
100 loops, best of 3: 12 msec per loop

otto:~/src/blist$ python3.1 -m timeit -s 'from blist import blist' -s
'import random' -s 'x = [random.randrange(1*1) for i in
range(1)]' 'y = blist(x)' 'y.sort(key=float)'
100 loops, best of 3: 3.52 msec per loop


Looks good so far. I would like to see that repeated all the way down to 
range(10) to make sure people doing millions of small sorts were not 
getting screwed.



list.sort is (near) linear for lists that are mostly ordered. I
think Tim's writeup about this in in the source. For instance, if
one extends a sorted with 1% random additions and resorts, list.sort
will skip the sorted part, sort the additions, and merge them back
in. Radix sort ignores pre-existing order. So either a lot of
comparitive profiling would be needed for auto-selection of sort
method, or it should be user selectable.
I've tested exactly that scenario.  For already-ordered lists, radix
sort and timsort tie.


Tim tested about 7 list structure scenarios with a range of lengths. If 
you post a patch, I would request that you do the same.


Have you run a patched version against test_sort.py? I believe it mostly 
tests lists of small ints, so radix methods would mostly be switched in.


If it were added and the switching were internal, new test cases would 
be needed to test test timsort.



Does your radix sort meet the stability guarantee of list.sort?

Yes.


Great. There is, of course, a test for that in the suite.

--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


Re: ANN: blist 1.2.0

2010-07-21 Thread Daniel Stutzbach
On Wed, Jul 21, 2010 at 5:27 PM, Terry Reedy  wrote:

> These tests use random numbers with a constant, relatively high density of
> 25%, which is favorable to radix sort. What if you do the same test with a
> constant range of, say, 10 (1 billion) or even a trillion or
> quadrillion. Or how about sorting a thousand 128bit ip6 addresses? Which
> wins for that?
>

blist switches to radix sort only if the keys contain only floats or only
integers that fit into a C long.  If the integers get too big, it reverts to
timsort.

When using a sort key, the decorate-sort-undecorate pattern requires
touching every object once before the sort.  The check for a consistent type
is done inexpensively during the decorate phase.

Here's an example result with a density of only 1%, where the radix sort is
around 4 times as fast:


otto:~/src/blist$ python3.1 -m timeit -s 'import random' -s 'x =
[random.randrange(1*100) for i in range(1)]' 'y = list(x)'
'y.sort(key=float)'
100 loops, best of 3: 12.4 msec per loop

otto:~/src/blist$ python3.1 -m timeit -s 'from blist import blist' -s
'import random' -s 'x = [random.randrange(1*100) for i in range(1)]'
'y = blist(x)' 'y.sort(key=float)'
100 loops, best of 3: 3.4 msec per loop


And a density of 0.01%:


otto:~/src/blist$ python3.1 -m timeit -s 'import random' -s 'x =
[random.randrange(1*1) for i in range(1)]' 'y = list(x)'
'y.sort(key=float)'
100 loops, best of 3: 12 msec per loop

otto:~/src/blist$ python3.1 -m timeit -s 'from blist import blist' -s
'import random' -s 'x = [random.randrange(1*1) for i in
range(1)]' 'y = blist(x)' 'y.sort(key=float)'
100 loops, best of 3: 3.52 msec per loop


> list.sort is (near) linear for lists that are mostly ordered. I think Tim's
> writeup about this in in the source. For instance, if one extends a sorted
> with 1% random additions and resorts, list.sort will skip the sorted part,
> sort the additions, and merge them back in. Radix sort ignores pre-existing
> order. So either a lot of comparitive profiling would be needed for
> auto-selection of sort method, or it should be user selectable.
>

I've tested exactly that scenario.  For already-ordered lists, radix sort
and timsort tie.


> Does your radix sort meet the stability guarantee of list.sort?
>

Yes.
--
Daniel Stutzbach, Ph.D.
President, Stutzbach Enterprises, LLC 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: ANN: blist 1.2.0

2010-07-21 Thread Terry Reedy

On 7/21/2010 12:15 PM, Daniel Stutzbach wrote:


Here's a performance comparison of sorting with blist versus 3.1's list:
http://stutzbachenterprises.com/performance-blist/sort-random-list


Question related to possible addition of radix sort to lists:

These tests use random numbers with a constant, relatively high density 
of 25%, which is favorable to radix sort. What if you do the same test 
with a constant range of, say, 10 (1 billion) or even a trillion 
or quadrillion. Or how about sorting a thousand 128bit ip6 addresses? 
Which wins for that?


list.sort is (near) linear for lists that are mostly ordered. I think 
Tim's writeup about this in in the source. For instance, if one extends 
a sorted with 1% random additions and resorts, list.sort will skip the 
sorted part, sort the additions, and merge them back in. Radix sort 
ignores pre-existing order. So either a lot of comparitive profiling 
would be needed for auto-selection of sort method, or it should be user 
selectable.


Does your radix sort meet the stability guarantee of list.sort?
If not, a separate .rsort method would be needed anyway.

--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


Re: ANN: blist 1.2.0

2010-07-21 Thread Daniel Stutzbach
On Wed, Jul 21, 2010 at 10:32 AM, Antoine Pitrou wrote:

> Are you using some kind of radix sort?
> Could it be contributed back into the standard list type?


Yes and yes.  I have a few mostly-complete patches on the tracker that I
need to polish off, as well as several easy-to-fix bugs that I need to take
care of.  After that I plan to work on porting my sort optimizations back to
the standard list type.

Here's a performance comparison of sorting with blist versus 3.1's list:
http://stutzbachenterprises.com/performance-blist/sort-random-list
--
Daniel Stutzbach, Ph.D.
President, Stutzbach Enterprises, LLC 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: ANN: blist 1.2.0

2010-07-21 Thread Antoine Pitrou
On Wed, 21 Jul 2010 09:47:08 -0500
Daniel Stutzbach  wrote:
> 
> What's new?
> ---
> 
> - blist.sort() is now *substantially* faster than list.sort() when using int
> or float keys (O(n) vs. O(n log n))

Are you using some kind of radix sort?
Could it be contributed back into the standard list type?

Regards

Antoine.


-- 
http://mail.python.org/mailman/listinfo/python-list


ANN: blist 1.2.0

2010-07-21 Thread Daniel Stutzbach
The blist package contains several container types to supplement those
built-in to Python.  These types closely mirror the built-in types' methods,
so the learning curve is close to zero.  The package includes:

- blist: a list type with O(log n) insertions, deletes, and slices and other
performance enhancements
- btuple: a tuple type with O(log) slices
- sorteddict: a dict type that efficiently keeps keys in sorted order
- sortedlist: a list type that efficiently keeps the items sorted order
- sortedset: an indexable set type that efficiently keeps the items sorted
order
- weaksortedlist, weaksortedset: weak reference versions of sortedlist and
sortedset

What's new?
---

- blist.sort() is now *substantially* faster than list.sort() when using int
or float keys (O(n) vs. O(n log n))
- The sortedset, sortedlist, and sorteddict types have been revamped for
better compatibility with the standard library's types.
- Comprehensive reference documentation is now available at
http://stutzbachenterprises.com/blist/
- Numerous other speed improvements

Links
-

- blist vs. list performance comparison:
http://stutzbachenterprises.com/performance-blist
- Documentation: http://stutzbachenterprises.com/blist/
- Download: http://pypi.python.org/pypi/blist/
- Source repository: http://github.com/DanielStutzbach/blist
- Issue tracker: http://github.com/DanielStutzbach/blist/issues
--
Daniel Stutzbach, Ph.D.
President, Stutzbach Enterprises, LLC 
-- 
http://mail.python.org/mailman/listinfo/python-list