Op 8/04/2022 om 16:28 schreef duncan smith:
On 08/04/2022 08:21, Antoon Pardon wrote:

Yes I know all that. That is why I keep a bucket of possible duplicates
per "identifying" field that is examined and use some heuristics at the
end of all the comparing instead of starting to weed out the duplicates
at the moment something differs.

The problem is, that when an identifying field is judged to be unusable,
the bucket to be associated with it should conceptually contain all other
records (which in this case are the indexes into the population list).
But that will eat a lot of memory. So I want some object that behaves as
if it is a (immutable) list of all these indexes without actually containing
them. A range object almost works, with the only problem it is not
comparable with a list.


Is there any reason why you can't use ints? Just set the relevant bits.

Well my first thought is that a bitset makes it less obvious to calulate
the size of the set or to iterate over its elements. But it is an idea
worth exploring.

--
Antoon.
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to