Re: removing duplication from a huge list.

Falcolas Fri, 27 Feb 2009 12:06:32 -0800

On Feb 27, 10:07 am, Steve Holden <st...@holdenweb.com> wrote:
> Assuming no duplicates, how does this help? You still have to verify
> collisions.


Absolutely. But a decent hashing function (particularly since you know
quite a bit about the data beforehand) will have very few collisions
(theoretically no collisions, again since we know the data
beforehand).

> The problem is you can't guarantee any hash value will be unique, so
> ultimately you have to check whether list entries hashing to the same
> value are or are not equal. Which would mean either having the whole
> list in memory or having access to it via some other random access method.

Again, absolutely. But as Tim pointed out, with a decent hash the
number of superfluous checks should be minimal, so your checks will
mostly occur on valid duplicate values.

It's worth noting that I picked this algorithm with the assumption
that any storage to physical memory (such as required by using a set)
would be out of the question.

~G
--
http://mail.python.org/mailman/listinfo/python-list

Re: removing duplication from a huge list.

Reply via email to