On Feb 27, 9:55 am, Falcolas <garri...@gmail.com> wrote: > If order did matter, and the list itself couldn't be stored in memory, > I would personally do some sort of hash of each item (or something as > simple as first 5 bytes, last 5 bytes and length), keeping a reference > to which item the hash belongs, sort and identify duplicates in the > hash, and using the reference check to see if the actual items in > question match as well. > > Pretty brutish and slow, but it's the first algorithm which comes to > mind. Of course, I'm assuming that the list items are long enough to > warrant using a hash and not the values themselves.
Might as well move all the duplication checking to sqlite. Although it seems tempting to stick a layer in front, you will always require either a full comparison or a full update, so there's no potential for a fast path. -- http://mail.python.org/mailman/listinfo/python-list