Re: removing duplication from a huge list.

Adam Olsen Tue, 03 Mar 2009 13:10:45 -0800

On Feb 27, 9:55 am, Falcolas <garri...@gmail.com> wrote:
> If order did matter, and the list itself couldn't be stored in memory,
> I would personally do some sort of hash of each item (or something as
> simple as first 5 bytes, last 5 bytes and length), keeping a reference
> to which item the hash belongs, sort and identify duplicates in the
> hash, and using the reference check to see if the actual items in
> question match as well.
>
> Pretty brutish and slow, but it's the first algorithm which comes to
> mind. Of course, I'm assuming that the list items are long enough to
> warrant using a hash and not the values themselves.


Might as well move all the duplication checking to sqlite.

Although it seems tempting to stick a layer in front, you will always
require either a full comparison or a full update, so there's no
potential for a fast path.
--
http://mail.python.org/mailman/listinfo/python-list

Re: removing duplication from a huge list.

Reply via email to