On 2022-04-07 16:16, Antoon Pardon wrote:
Op 7/04/2022 om 16:08 schreef Joel Goldstick:
On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon<antoon.par...@vub.be>  wrote:
I am working with a list of data from which I have to weed out duplicates.
At the moment I keep for each entry a container with the other entries
that are still possible duplicates.

The problem is sometimes that is all the rest. I thought to use a range
object for these cases. Unfortunatly I sometimes want to sort things
and a range object is not comparable with a list or a tuple.

So I have a list of items where each item is itself a list or range object.
I of course could sort this by using list as a key function but that
would defeat the purpose of using range objects for these cases.

So what would be a relatively easy way to get the same result without wasting
too much memory on entries that haven't any weeding done on them.

--
Antoon Pardon.
--
https://mail.python.org/mailman/listinfo/python-list
I'm not sure I understand what you are trying to do, but if your data
has no order, you can use set to remove the duplicates

Sorry I wasn't clear. The data contains information about persons. But not
all records need to be complete. So a person can occur multiple times in
the list, while the records are all different because they are missing
different bits.

So all records with the same firstname can be duplicates. But if I have
a record in which the firstname is missing, it can at that point be
a duplicate of all other records.

This is how I'd approach it:

# Make a list of groups, where each group is a list of potential duplicates.
# Initially, all of the records are potential duplicates of each other.
records = [list_of_records]

# Split the groups into subgroups according to the first name.
new_records = []

for group in records:
    subgroups = defaultdict(list)

    for record in group:
        subgroups[record['first_name']].append(record)

    # Records without a first name could belong to any of the subgroups.
    missing = subgroups.pop(None, [])

    for record in missing:
        for subgroup in subgroups.values():
            subgroup.extend(missing)

    new_records.extend(subgroups.values())

records = new_records

# Now repeat for the last name, etc.
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to