On 3/13/2011 2:05 PM, Daniel Stutzbach wrote:
On Sat, Mar 12, 2011 at 3:44 PM, Guido van Rossum <gu...@python.org
<mailto:gu...@python.org>> wrote:

    I recently advised a Googler who was sorting a large dataset and
    running out of memory. My analysis of the situation was that he was
    sorting a huge list of short lines of the form "shortstring,integer"
    with a key function that returned a tuple of the form ("shortstring",
    integer).


As Raymond pointed out, a change I made for 3.2 significantly shrinks
the memory footprint of sorting with a key (although it's still more
memory-intensive than sorting with cmp).

He could reduce the memory footprint further by sorting in two passes
instead of using a tuple, leveraging the fact that Python guarantees a
stable sort.  In 3.2 or later, this technique will require roughly twice
as much memory as just storing the list:

biglist.sort(key=lambda s: int(s.split(',')[1]))  # Sort by the integer
biglist.sort(key=lambda s: s.split(',')[0])  # Sort by the shortstring

I think the use cases are pretty narrow where there's plenty of memory
for storing the list but not enough to store two copies.

Here are two variations on the idea of two pass sorting, both of which only sort ints for duplicate shortstrings and neither of which use key=.

1. If duplication of shortstrings is (known to be) sparse, sort the lines as they are, then scan to detect runs (slices) of duplicate shortstrings and sort and replace those. (If sort had start and stop index keywords, this could be done in place.)

2. If duplication of shortstrings is (known to be) heavy, read the dataset into a shortstring-list_of_ints dict rathar than a list. Sort the keys in a separate (much shorted) list and sort each value (list of ints) separately. For some post-sort usages, this would be more useful than a flat sorted list.

A third idea is to presort the dataset into chunks (a-m, n-z) or however appropriate, perhaps as it is gathered; sort each separately; and then concatenate. This will handle cases where even the flat list is too big for memory.

--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to