Re: Comparing sequences with range objects
On 09/04/2022 13:14, Christian Gollwitzer wrote: Am 08.04.22 um 09:21 schrieb Antoon Pardon: The first is really hard. Not only may information be missing, no single single piece of information is unique or immutable. Two people may have the same name (I know about several other "Peter Holzer"s), a single person might change their name (when I was younger I went by my middle name - how would you know that "Peter Holzer" and "Hansi Holzer" are the same person?), they will move (= change their address), change jobs, etc. Unless you have a unique immutable identifier that's enforced by some authority (like a social security number[1]), I don't think there is a chance to do that reliably in a program (although with enough data, a heuristic may be good enough). Yes I know all that. That is why I keep a bucket of possible duplicates per "identifying" field that is examined and use some heuristics at the end of all the comparing instead of starting to weed out the duplicates at the moment something differs. The problem is, that when an identifying field is judged to be unusable, the bucket to be associated with it should conceptually contain all other records (which in this case are the indexes into the population list). But that will eat a lot of memory. So I want some object that behaves as if it is a (immutable) list of all these indexes without actually containing them. A range object almost works, with the only problem it is not comparable with a list. Then write your own comparator function? Also, if the only case where this actually works is the index of all other records, then a simple boolean flag "all" vs. "these items in the index list" would suffice - doesn't it? Christian Writing a comparator function is only possible for a given key. So my approach would be: 1) Write a comparator function that takes params X and Y, such that: if key data is missing from X, return 1 If key data is missing from Y return -1 if X > Y return 1 if X < Y return -1 return 0 # They are equal and key data for both is present 2) Sort the data using the comparator function. 3) Run through the data with a trailing enumeration loop, merging matching records together. 4) If there are no records copied out with missing key data, then you are done, so exit. 5) Choose a new key and repeat from step 1). Regards Ian -- Ian Hobson Tel (+66) 626 544 695 -- This email has been checked for viruses by AVG. https://www.avg.com -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
Am 08.04.22 um 09:21 schrieb Antoon Pardon: The first is really hard. Not only may information be missing, no single single piece of information is unique or immutable. Two people may have the same name (I know about several other "Peter Holzer"s), a single person might change their name (when I was younger I went by my middle name - how would you know that "Peter Holzer" and "Hansi Holzer" are the same person?), they will move (= change their address), change jobs, etc. Unless you have a unique immutable identifier that's enforced by some authority (like a social security number[1]), I don't think there is a chance to do that reliably in a program (although with enough data, a heuristic may be good enough). Yes I know all that. That is why I keep a bucket of possible duplicates per "identifying" field that is examined and use some heuristics at the end of all the comparing instead of starting to weed out the duplicates at the moment something differs. The problem is, that when an identifying field is judged to be unusable, the bucket to be associated with it should conceptually contain all other records (which in this case are the indexes into the population list). But that will eat a lot of memory. So I want some object that behaves as if it is a (immutable) list of all these indexes without actually containing them. A range object almost works, with the only problem it is not comparable with a list. Then write your own comparator function? Also, if the only case where this actually works is the index of all other records, then a simple boolean flag "all" vs. "these items in the index list" would suffice - doesn't it? Christian -- https://mail.python.org/mailman/listinfo/python-list
Re: Comparing sequences with range objects
On 08/04/2022 22:08, Antoon Pardon wrote: Op 8/04/2022 om 16:28 schreef duncan smith: On 08/04/2022 08:21, Antoon Pardon wrote: Yes I know all that. That is why I keep a bucket of possible duplicates per "identifying" field that is examined and use some heuristics at the end of all the comparing instead of starting to weed out the duplicates at the moment something differs. The problem is, that when an identifying field is judged to be unusable, the bucket to be associated with it should conceptually contain all other records (which in this case are the indexes into the population list). But that will eat a lot of memory. So I want some object that behaves as if it is a (immutable) list of all these indexes without actually containing them. A range object almost works, with the only problem it is not comparable with a list. Is there any reason why you can't use ints? Just set the relevant bits. Well my first thought is that a bitset makes it less obvious to calulate the size of the set or to iterate over its elements. But it is an idea worth exploring. def popcount(n): """ Returns the number of set bits in n """ cnt = 0 while n: n &= n - 1 cnt += 1 return cnt and not tested, def iterinds(n): """ Returns a generator of the indices of the set bits of n """ i = 0 while n: if n & 1: yield i n = n >> 1 i += 1 Duncan -- https://mail.python.org/mailman/listinfo/python-list
Re: Issues
On 4/8/22 14:24, MRAB wrote: > On 2022-04-08 20:35, Stevenson, John B via Python-list wrote: >> Hello, >> >> As a quick disclaimer, I am sorry if you have received this message >> multiple times over from me. I've been having technical difficulties >> trying to reach this email. Thank you. >> >> I'm trying to install Python on a computer so that I can use it for >> various tasks for my job, like mapping and programming. But it's not >> downloading the necessary files into the right repository for me to >> run Python commands in a command prompt. I can open the Python app >> just fine, but I cannot use it in the terminal, and this messes with >> pip and prevents me from doing my task. What can I do to fix this? >> Error sent back is "'python' is not recognized as an internal or >> external command, operable program or batch file." Thank you. >> > Try the Python Launcher instead by typing "py" instead of "python". And read this: https://docs.python.org/3/using/windows.html#launcher (other parts of the page will probably be useful to you as well) -- https://mail.python.org/mailman/listinfo/python-list