Re: Comparing sequences with range objects

2022-04-12 Thread Antoon Pardon




Op 11/04/2022 om 02:31 schreef Dan Stromberg:


It sounds a little like you're looking for interval arithmetic.

Maybe https://pypi.org/project/python-intervals/1.5.3/ ?


Not completely but it suggested an idea to explore.

--
Antoon.
--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-12 Thread Antoon Pardon



Op 11/04/2022 om 02:01 schreef duncan smith:

On 10/04/2022 21:20, Antoon Pardon wrote:



Op 9/04/2022 om 02:01 schreef duncan smith:

On 08/04/2022 22:08, Antoon Pardon wrote:


Well my first thought is that a bitset makes it less obvious to 
calulate

the size of the set or to iterate over its elements. But it is an idea
worth exploring.





def popcount(n):
    """
    Returns the number of set bits in n
    """
    cnt = 0
    while n:
    n &= n - 1
    cnt += 1
    return cnt

and not tested,

def iterinds(n):
    """
    Returns a generator of the indices of the set bits of n
    """
    i = 0
    while n:
    if n & 1:
    yield i
    n = n >> 1
    i += 1

Sure but these seem rather naive implementation with a time 
complexity of
O(n) where n is the maximum number of possible elements. Using these 
would

turn my O(n) algorithm in a O(n^2) algorithm.



I thought your main concern was memory. Of course, dependent on 
various factors, you might be able to do much better than the above. 
But I don't know what your O(n) algorithm is, how using a bitset would 
make it O(n^2), or if the O(n^2) algorithm would actually be slower 
for typical n. The overall thing sounds broadly like some of the 
blocking and clustering methods I've come across in record linkage.


Using bitsets would make change my algorithm from O(n) into O(n^2) because
the bitset operations are O(n) instead of O(1). If I have a 2 people
a bitset will take a vector of about 300, 64bit words. With 4 people
the bitset will take a vector of about 600. So doubling the population
will also double the time for the bitset operations, meaning doubling
the population will increase execution time four times.

--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-10 Thread Dan Stromberg
Two more things:
1) There are also modules that do interval arithmetic with reals, not just
integers.  EG: https://pyinterval.readthedocs.io/en/latest/
2) If you don't mind a small amount of inaccuracy you can often get things
down to less than one bit per element for presence/absence, using a bloom
filter.  EG: https://pypi.org/project/drs-bloom-filter/

On Sun, Apr 10, 2022 at 5:31 PM Dan Stromberg  wrote:

>
> It sounds a little like you're looking for interval arithmetic.
>
> Maybe https://pypi.org/project/python-intervals/1.5.3/ ?
>
> On Thu, Apr 7, 2022 at 4:19 AM Antoon Pardon  wrote:
>
>> I am working with a list of data from which I have to weed out duplicates.
>> At the moment I keep for each entry a container with the other entries
>> that are still possible duplicates.
>>
>> The problem is sometimes that is all the rest. I thought to use a range
>> object for these cases. Unfortunatly I sometimes want to sort things
>> and a range object is not comparable with a list or a tuple.
>>
>> So I have a list of items where each item is itself a list or range
>> object.
>> I of course could sort this by using list as a key function but that
>> would defeat the purpose of using range objects for these cases.
>>
>> So what would be a relatively easy way to get the same result without
>> wasting
>> too much memory on entries that haven't any weeding done on them.
>>
>> --
>> Antoon Pardon.
>> --
>> https://mail.python.org/mailman/listinfo/python-list
>>
>
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-10 Thread 2QdxY4RzWzUUiLuE
On 2022-04-10 at 22:20:33 +0200,
Antoon Pardon  wrote:

> 
> 
> Op 9/04/2022 om 02:01 schreef duncan smith:
> > On 08/04/2022 22:08, Antoon Pardon wrote:
> > > 
> > > Well my first thought is that a bitset makes it less obvious to calulate
> > > the size of the set or to iterate over its elements. But it is an idea
> > > worth exploring.
> > > 
> > 
> > 
> > 
> > def popcount(n):
> >     """
> >     Returns the number of set bits in n
> >     """
> >     cnt = 0
> >     while n:
> >     n &= n - 1
> >     cnt += 1
> >     return cnt
> > 
> > and not tested,
> > 
> > def iterinds(n):
> >     """
> >     Returns a generator of the indices of the set bits of n
> >     """
> >     i = 0
> >     while n:
> >     if n & 1:
> >     yield i
> >     n = n >> 1
> >     i += 1
> > 
> Sure but these seem rather naive implementation with a time complexity of
> O(n) where n is the maximum number of possible elements. Using these would
> turn my O(n) algorithm in a O(n^2) algorithm.

O(n) where n is the expected number of elements.  The loops iterate once
for each bit actually contained in the set, which is usually [much] less
than the size of the universe.  If you have lots and lots of elements in
your sets, or your universe is large and n is a long integer, then these
may not be as efficient as other methods.  You know your data better
than we do.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-10 Thread Dan Stromberg
It sounds a little like you're looking for interval arithmetic.

Maybe https://pypi.org/project/python-intervals/1.5.3/ ?

On Thu, Apr 7, 2022 at 4:19 AM Antoon Pardon  wrote:

> I am working with a list of data from which I have to weed out duplicates.
> At the moment I keep for each entry a container with the other entries
> that are still possible duplicates.
>
> The problem is sometimes that is all the rest. I thought to use a range
> object for these cases. Unfortunatly I sometimes want to sort things
> and a range object is not comparable with a list or a tuple.
>
> So I have a list of items where each item is itself a list or range object.
> I of course could sort this by using list as a key function but that
> would defeat the purpose of using range objects for these cases.
>
> So what would be a relatively easy way to get the same result without
> wasting
> too much memory on entries that haven't any weeding done on them.
>
> --
> Antoon Pardon.
> --
> https://mail.python.org/mailman/listinfo/python-list
>
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-10 Thread duncan smith

On 10/04/2022 21:20, Antoon Pardon wrote:



Op 9/04/2022 om 02:01 schreef duncan smith:

On 08/04/2022 22:08, Antoon Pardon wrote:


Well my first thought is that a bitset makes it less obvious to calulate
the size of the set or to iterate over its elements. But it is an idea
worth exploring.





def popcount(n):
    """
    Returns the number of set bits in n
    """
    cnt = 0
    while n:
    n &= n - 1
    cnt += 1
    return cnt

and not tested,

def iterinds(n):
    """
    Returns a generator of the indices of the set bits of n
    """
    i = 0
    while n:
    if n & 1:
    yield i
    n = n >> 1
    i += 1


Sure but these seem rather naive implementation with a time complexity of
O(n) where n is the maximum number of possible elements. Using these would
turn my O(n) algorithm in a O(n^2) algorithm.



I thought your main concern was memory. Of course, dependent on various 
factors, you might be able to do much better than the above. But I don't 
know what your O(n) algorithm is, how using a bitset would make it 
O(n^2), or if the O(n^2) algorithm would actually be slower for typical 
n. The overall thing sounds broadly like some of the blocking and 
clustering methods I've come across in record linkage.


Duncan
--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-10 Thread Antoon Pardon



Op 9/04/2022 om 02:01 schreef duncan smith:

On 08/04/2022 22:08, Antoon Pardon wrote:


Well my first thought is that a bitset makes it less obvious to calulate
the size of the set or to iterate over its elements. But it is an idea
worth exploring.





def popcount(n):
    """
    Returns the number of set bits in n
    """
    cnt = 0
    while n:
    n &= n - 1
    cnt += 1
    return cnt

and not tested,

def iterinds(n):
    """
    Returns a generator of the indices of the set bits of n
    """
    i = 0
    while n:
    if n & 1:
    yield i
    n = n >> 1
    i += 1


Sure but these seem rather naive implementation with a time complexity of
O(n) where n is the maximum number of possible elements. Using these would
turn my O(n) algorithm in a O(n^2) algorithm.

--
Antoon Pardon.
--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-09 Thread Ian Hobson



On 09/04/2022 13:14, Christian Gollwitzer wrote:

Am 08.04.22 um 09:21 schrieb Antoon Pardon:

The first is really hard. Not only may information be missing, no single
single piece of information is unique or immutable. Two people may have
the same name (I know about several other "Peter Holzer"s), a single
person might change their name (when I was younger I went by my middle
name - how would you know that "Peter Holzer" and "Hansi Holzer" are the
same person?), they will move (= change their address), change jobs,
etc. Unless you have a unique immutable identifier that's enforced by
some authority (like a social security number[1]), I don't think there
is a chance to do that reliably in a program (although with enough data,
a heuristic may be good enough).


Yes I know all that. That is why I keep a bucket of possible duplicates
per "identifying" field that is examined and use some heuristics at the
end of all the comparing instead of starting to weed out the duplicates
at the moment something differs.

The problem is, that when an identifying field is judged to be unusable,
the bucket to be associated with it should conceptually contain all other
records (which in this case are the indexes into the population list).
But that will eat a lot of memory. So I want some object that behaves as
if it is a (immutable) list of all these indexes without actually 
containing

them. A range object almost works, with the only problem it is not
comparable with a list.



Then write your own comparator function?

Also, if the only case where this actually works is the index of all 
other records, then a simple boolean flag "all" vs. "these items in the 
index list" would suffice - doesn't it?


 Christian

Writing a comparator function is only possible for a given key. So my 
approach would be:


1) Write a comparator function that takes params X and Y, such that:
if key data is missing from X, return 1
If key data is missing from Y return -1
if X > Y return 1
if X < Y return -1
return 0 # They are equal and key data for both is present

2) Sort the data using the comparator function.

3) Run through the data with a trailing enumeration loop, merging 
matching records together.


4) If there are no records copied out with missing
key data, then you are done, so exit.

5) Choose a new key and repeat from step 1).

Regards

Ian

--
Ian Hobson
Tel (+66) 626 544 695

--
This email has been checked for viruses by AVG.
https://www.avg.com

--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-09 Thread Christian Gollwitzer

Am 08.04.22 um 09:21 schrieb Antoon Pardon:

The first is really hard. Not only may information be missing, no single
single piece of information is unique or immutable. Two people may have
the same name (I know about several other "Peter Holzer"s), a single
person might change their name (when I was younger I went by my middle
name - how would you know that "Peter Holzer" and "Hansi Holzer" are the
same person?), they will move (= change their address), change jobs,
etc. Unless you have a unique immutable identifier that's enforced by
some authority (like a social security number[1]), I don't think there
is a chance to do that reliably in a program (although with enough data,
a heuristic may be good enough).


Yes I know all that. That is why I keep a bucket of possible duplicates
per "identifying" field that is examined and use some heuristics at the
end of all the comparing instead of starting to weed out the duplicates
at the moment something differs.

The problem is, that when an identifying field is judged to be unusable,
the bucket to be associated with it should conceptually contain all other
records (which in this case are the indexes into the population list).
But that will eat a lot of memory. So I want some object that behaves as
if it is a (immutable) list of all these indexes without actually 
containing

them. A range object almost works, with the only problem it is not
comparable with a list.



Then write your own comparator function?

Also, if the only case where this actually works is the index of all 
other records, then a simple boolean flag "all" vs. "these items in the 
index list" would suffice - doesn't it?


Christian

--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-09 Thread duncan smith

On 08/04/2022 22:08, Antoon Pardon wrote:


Op 8/04/2022 om 16:28 schreef duncan smith:

On 08/04/2022 08:21, Antoon Pardon wrote:


Yes I know all that. That is why I keep a bucket of possible duplicates
per "identifying" field that is examined and use some heuristics at the
end of all the comparing instead of starting to weed out the duplicates
at the moment something differs.

The problem is, that when an identifying field is judged to be unusable,
the bucket to be associated with it should conceptually contain all 
other

records (which in this case are the indexes into the population list).
But that will eat a lot of memory. So I want some object that behaves as
if it is a (immutable) list of all these indexes without actually 
containing

them. A range object almost works, with the only problem it is not
comparable with a list.



Is there any reason why you can't use ints? Just set the relevant bits.


Well my first thought is that a bitset makes it less obvious to calulate
the size of the set or to iterate over its elements. But it is an idea
worth exploring.





def popcount(n):
"""
Returns the number of set bits in n
"""
cnt = 0
while n:
n &= n - 1
cnt += 1
return cnt

and not tested,

def iterinds(n):
"""
Returns a generator of the indices of the set bits of n
"""
i = 0
while n:
if n & 1:
yield i
n = n >> 1
i += 1


Duncan


--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-08 Thread Antoon Pardon



Op 8/04/2022 om 16:28 schreef duncan smith:

On 08/04/2022 08:21, Antoon Pardon wrote:


Yes I know all that. That is why I keep a bucket of possible duplicates
per "identifying" field that is examined and use some heuristics at the
end of all the comparing instead of starting to weed out the duplicates
at the moment something differs.

The problem is, that when an identifying field is judged to be unusable,
the bucket to be associated with it should conceptually contain all 
other

records (which in this case are the indexes into the population list).
But that will eat a lot of memory. So I want some object that behaves as
if it is a (immutable) list of all these indexes without actually 
containing

them. A range object almost works, with the only problem it is not
comparable with a list.



Is there any reason why you can't use ints? Just set the relevant bits.


Well my first thought is that a bitset makes it less obvious to calulate
the size of the set or to iterate over its elements. But it is an idea
worth exploring.

--
Antoon.
--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-08 Thread duncan smith

On 08/04/2022 08:21, Antoon Pardon wrote:



Op 8/04/2022 om 08:24 schreef Peter J. Holzer:

On 2022-04-07 17:16:41 +0200, Antoon Pardon wrote:

Op 7/04/2022 om 16:08 schreef Joel Goldstick:
On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon   
wrote:
I am working with a list of data from which I have to weed out 
duplicates.

At the moment I keep for each entry a container with the other entries
that are still possible duplicates.

[...]
Sorry I wasn't clear. The data contains information about persons. 
But not

all records need to be complete. So a person can occur multiple times in
the list, while the records are all different because they are missing
different bits.

So all records with the same firstname can be duplicates. But if I have
a record in which the firstname is missing, it can at that point be
a duplicate of all other records.

There are two problems. The first one is how do you establish identity.
The second is how do you ween out identical objects. In your first mail
you only asked about the second, but that's easy.

The first is really hard. Not only may information be missing, no single
single piece of information is unique or immutable. Two people may have
the same name (I know about several other "Peter Holzer"s), a single
person might change their name (when I was younger I went by my middle
name - how would you know that "Peter Holzer" and "Hansi Holzer" are the
same person?), they will move (= change their address), change jobs,
etc. Unless you have a unique immutable identifier that's enforced by
some authority (like a social security number[1]), I don't think there
is a chance to do that reliably in a program (although with enough data,
a heuristic may be good enough).


Yes I know all that. That is why I keep a bucket of possible duplicates
per "identifying" field that is examined and use some heuristics at the
end of all the comparing instead of starting to weed out the duplicates
at the moment something differs.

The problem is, that when an identifying field is judged to be unusable,
the bucket to be associated with it should conceptually contain all other
records (which in this case are the indexes into the population list).
But that will eat a lot of memory. So I want some object that behaves as
if it is a (immutable) list of all these indexes without actually 
containing

them. A range object almost works, with the only problem it is not
comparable with a list.



Is there any reason why you can't use ints? Just set the relevant bits.

Duncan
--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-08 Thread Antoon Pardon




Op 8/04/2022 om 08:24 schreef Peter J. Holzer:

On 2022-04-07 17:16:41 +0200, Antoon Pardon wrote:

Op 7/04/2022 om 16:08 schreef Joel Goldstick:

On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon   wrote:

I am working with a list of data from which I have to weed out duplicates.
At the moment I keep for each entry a container with the other entries
that are still possible duplicates.

[...]

Sorry I wasn't clear. The data contains information about persons. But not
all records need to be complete. So a person can occur multiple times in
the list, while the records are all different because they are missing
different bits.

So all records with the same firstname can be duplicates. But if I have
a record in which the firstname is missing, it can at that point be
a duplicate of all other records.

There are two problems. The first one is how do you establish identity.
The second is how do you ween out identical objects. In your first mail
you only asked about the second, but that's easy.

The first is really hard. Not only may information be missing, no single
single piece of information is unique or immutable. Two people may have
the same name (I know about several other "Peter Holzer"s), a single
person might change their name (when I was younger I went by my middle
name - how would you know that "Peter Holzer" and "Hansi Holzer" are the
same person?), they will move (= change their address), change jobs,
etc. Unless you have a unique immutable identifier that's enforced by
some authority (like a social security number[1]), I don't think there
is a chance to do that reliably in a program (although with enough data,
a heuristic may be good enough).


Yes I know all that. That is why I keep a bucket of possible duplicates
per "identifying" field that is examined and use some heuristics at the
end of all the comparing instead of starting to weed out the duplicates
at the moment something differs.

The problem is, that when an identifying field is judged to be unusable,
the bucket to be associated with it should conceptually contain all other
records (which in this case are the indexes into the population list).
But that will eat a lot of memory. So I want some object that behaves as
if it is a (immutable) list of all these indexes without actually containing
them. A range object almost works, with the only problem it is not
comparable with a list.

--
Antoon Pardon.

--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-08 Thread Chris Angelico
On Fri, 8 Apr 2022 at 16:26, Peter J. Holzer  wrote:
> Unless you have a unique immutable identifier that's enforced by
> some authority (like a social security number[1]), I don't think there
> is a chance to do that reliably in a program (although with enough data,
> a heuristic may be good enough).
>

Not sure what your footnote was supposed to be, but I'll offer two
useful footnotes to that:

[1] [a] Though they're not actually immutable either, just less
frequently changing

[1[ [b] Of course, that has privacy implications.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-08 Thread Peter J. Holzer
On 2022-04-07 17:16:41 +0200, Antoon Pardon wrote:
> Op 7/04/2022 om 16:08 schreef Joel Goldstick:
> > On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon  wrote:
> > > I am working with a list of data from which I have to weed out duplicates.
> > > At the moment I keep for each entry a container with the other entries
> > > that are still possible duplicates.
[...]
> Sorry I wasn't clear. The data contains information about persons. But not
> all records need to be complete. So a person can occur multiple times in
> the list, while the records are all different because they are missing
> different bits.
> 
> So all records with the same firstname can be duplicates. But if I have
> a record in which the firstname is missing, it can at that point be
> a duplicate of all other records.

There are two problems. The first one is how do you establish identity.
The second is how do you ween out identical objects. In your first mail
you only asked about the second, but that's easy.

The first is really hard. Not only may information be missing, no single
single piece of information is unique or immutable. Two people may have
the same name (I know about several other "Peter Holzer"s), a single
person might change their name (when I was younger I went by my middle
name - how would you know that "Peter Holzer" and "Hansi Holzer" are the
same person?), they will move (= change their address), change jobs,
etc. Unless you have a unique immutable identifier that's enforced by
some authority (like a social security number[1]), I don't think there
is a chance to do that reliably in a program (although with enough data,
a heuristic may be good enough).

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-07 Thread MRAB

On 2022-04-07 16:16, Antoon Pardon wrote:

Op 7/04/2022 om 16:08 schreef Joel Goldstick:

On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon  wrote:

I am working with a list of data from which I have to weed out duplicates.
At the moment I keep for each entry a container with the other entries
that are still possible duplicates.

The problem is sometimes that is all the rest. I thought to use a range
object for these cases. Unfortunatly I sometimes want to sort things
and a range object is not comparable with a list or a tuple.

So I have a list of items where each item is itself a list or range object.
I of course could sort this by using list as a key function but that
would defeat the purpose of using range objects for these cases.

So what would be a relatively easy way to get the same result without wasting
too much memory on entries that haven't any weeding done on them.

--
Antoon Pardon.
--
https://mail.python.org/mailman/listinfo/python-list

I'm not sure I understand what you are trying to do, but if your data
has no order, you can use set to remove the duplicates


Sorry I wasn't clear. The data contains information about persons. But not
all records need to be complete. So a person can occur multiple times in
the list, while the records are all different because they are missing
different bits.

So all records with the same firstname can be duplicates. But if I have
a record in which the firstname is missing, it can at that point be
a duplicate of all other records.


This is how I'd approach it:

# Make a list of groups, where each group is a list of potential duplicates.
# Initially, all of the records are potential duplicates of each other.
records = [list_of_records]

# Split the groups into subgroups according to the first name.
new_records = []

for group in records:
subgroups = defaultdict(list)

for record in group:
subgroups[record['first_name']].append(record)

# Records without a first name could belong to any of the subgroups.
missing = subgroups.pop(None, [])

for record in missing:
for subgroup in subgroups.values():
subgroup.extend(missing)

new_records.extend(subgroups.values())

records = new_records

# Now repeat for the last name, etc.
--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-07 Thread Dieter Maurer
Antoon Pardon wrote at 2022-4-7 17:16 +0200:
> ...
>Sorry I wasn't clear. The data contains information about persons. But not
>all records need to be complete. So a person can occur multiple times in
>the list, while the records are all different because they are missing
>different bits.
>
>So all records with the same firstname can be duplicates. But if I have
>a record in which the firstname is missing, it can at that point be
>a duplicate of all other records.

The description is still not clear enough. Especially, it does
not show where `range` objects come into play.


Answering on a more abstract level:
Apparently, you want to sort a list the elements of which can
be other lists or range objects.

List objects are ordered lexicographically, i.e. for lists l1, l2:
l1 <= l2 iff not l1 or (l2 and l1[0] <= l2[0] and l1[1:] <= l2[1:]).
If you want to sort a list containing list elements are compared using
this order.

For your case, you would need to use a `key` parameter for `sort` that
implements this order for `range` objects, too.
(Note that Python provides a function which transforms an order
definition into an appropriate `key` function).
A corresponding `sort` call may expand your range objects completely.

An alternative might be to not expand `range` objects but to put them
all at the start or end of the sorted list.
Of course, this would imply that their expansion does not influence
their order in the list -- which may or may not be acceptable
(depending on your use case).
If it is acceptable, it is likely possible to not put range objects
into the list to be sorted in the first place.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-07 Thread Antoon Pardon

Op 7/04/2022 om 16:08 schreef Joel Goldstick:

On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon  wrote:

I am working with a list of data from which I have to weed out duplicates.
At the moment I keep for each entry a container with the other entries
that are still possible duplicates.

The problem is sometimes that is all the rest. I thought to use a range
object for these cases. Unfortunatly I sometimes want to sort things
and a range object is not comparable with a list or a tuple.

So I have a list of items where each item is itself a list or range object.
I of course could sort this by using list as a key function but that
would defeat the purpose of using range objects for these cases.

So what would be a relatively easy way to get the same result without wasting
too much memory on entries that haven't any weeding done on them.

--
Antoon Pardon.
--
https://mail.python.org/mailman/listinfo/python-list

I'm not sure I understand what you are trying to do, but if your data
has no order, you can use set to remove the duplicates


Sorry I wasn't clear. The data contains information about persons. But not
all records need to be complete. So a person can occur multiple times in
the list, while the records are all different because they are missing
different bits.

So all records with the same firstname can be duplicates. But if I have
a record in which the firstname is missing, it can at that point be
a duplicate of all other records.

--
Antoon Pardon

--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing sequences with range objects

2022-04-07 Thread Joel Goldstick
On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon  wrote:
>
> I am working with a list of data from which I have to weed out duplicates.
> At the moment I keep for each entry a container with the other entries
> that are still possible duplicates.
>
> The problem is sometimes that is all the rest. I thought to use a range
> object for these cases. Unfortunatly I sometimes want to sort things
> and a range object is not comparable with a list or a tuple.
>
> So I have a list of items where each item is itself a list or range object.
> I of course could sort this by using list as a key function but that
> would defeat the purpose of using range objects for these cases.
>
> So what would be a relatively easy way to get the same result without wasting
> too much memory on entries that haven't any weeding done on them.
>
> --
> Antoon Pardon.
> --
> https://mail.python.org/mailman/listinfo/python-list

I'm not sure I understand what you are trying to do, but if your data
has no order, you can use set to remove the duplicates

-- 
Joel Goldstick
-- 
https://mail.python.org/mailman/listinfo/python-list


Comparing sequences with range objects

2022-04-07 Thread Antoon Pardon

I am working with a list of data from which I have to weed out duplicates.
At the moment I keep for each entry a container with the other entries
that are still possible duplicates.

The problem is sometimes that is all the rest. I thought to use a range
object for these cases. Unfortunatly I sometimes want to sort things
and a range object is not comparable with a list or a tuple.

So I have a list of items where each item is itself a list or range object.
I of course could sort this by using list as a key function but that
would defeat the purpose of using range objects for these cases.

So what would be a relatively easy way to get the same result without wasting
too much memory on entries that haven't any weeding done on them.

--
Antoon Pardon.
--
https://mail.python.org/mailman/listinfo/python-list