On 3/13/2011 2:05 PM, Daniel Stutzbach wrote:
On Sat, Mar 12, 2011 at 3:44 PM, Guido van Rossum <gu...@python.org
<mailto:gu...@python.org>> wrote:
I recently advised a Googler who was sorting a large dataset and
running out of memory. My analysis of the situation was that he was
sorting a huge list of short lines of the form "shortstring,integer"
with a key function that returned a tuple of the form ("shortstring",
integer).
As Raymond pointed out, a change I made for 3.2 significantly shrinks
the memory footprint of sorting with a key (although it's still more
memory-intensive than sorting with cmp).
He could reduce the memory footprint further by sorting in two passes
instead of using a tuple, leveraging the fact that Python guarantees a
stable sort. In 3.2 or later, this technique will require roughly twice
as much memory as just storing the list:
biglist.sort(key=lambda s: int(s.split(',')[1])) # Sort by the integer
biglist.sort(key=lambda s: s.split(',')[0]) # Sort by the shortstring
I think the use cases are pretty narrow where there's plenty of memory
for storing the list but not enough to store two copies.
Here are two variations on the idea of two pass sorting, both of which
only sort ints for duplicate shortstrings and neither of which use key=.
1. If duplication of shortstrings is (known to be) sparse, sort the
lines as they are, then scan to detect runs (slices) of duplicate
shortstrings and sort and replace those. (If sort had start and stop
index keywords, this could be done in place.)
2. If duplication of shortstrings is (known to be) heavy, read the
dataset into a shortstring-list_of_ints dict rathar than a list. Sort
the keys in a separate (much shorted) list and sort each value (list of
ints) separately. For some post-sort usages, this would be more useful
than a flat sorted list.
A third idea is to presort the dataset into chunks (a-m, n-z) or however
appropriate, perhaps as it is gathered; sort each separately; and then
concatenate. This will handle cases where even the flat list is too big
for memory.
--
Terry Jan Reedy
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com