[issue39891] [difflib] Improve get_close_matches() to better match when casing of words are different

2020-06-23 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

I concur with the other respondents that this is best left to the application 
code.


Thank you for the suggestion, but I'll mark this as closed.  Don't be deterred 
from making other suggestions :-)

--
resolution:  -> rejected
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39891] [difflib] Improve get_close_matches() to better match when casing of words are different

2020-06-21 Thread Rémi Lapeyre

Rémi Lapeyre  added the comment:

I fell like it's a bit weird to have a new function just for ignoring case, 
will a new function be required for every possible normalization like removing 
accents. One possible make the API handle those use cases would be to have a 
keyword-argument for this:


>>> difflib.get_close_matches('apple', ['APPLE'], normalization=str.lower)
['APPLE']

Then it could work with other normalization too without requiring a new 
function every time:

>>> difflib.get_close_matches('Remi', ['Rémi'], normalization=remove_a ccents)
['Rémi']

--
components: +Library (Lib)
nosy: +remi.lapeyre
versions: +Python 3.10 -Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39891] [difflib] Improve get_close_matches() to better match when casing of words are different

2020-04-08 Thread brian.gallagher


brian.gallagher  added the comment:

Just giving this a bump, in case it has been forgotten about.

I've posted a patch at https://github.com/python/cpython/pull/18983.

It adds a new parameter "ignorecase" to get_close_matches() that, if set to 
True, will result in the SequenceMatcher treating any character case 
insensitively (as determined by str.lower()).

The benefit to using this keyword, as opposed to letting the application handle 
the normalization, is that it saves on memory. If the application has to 
normalize and supply a separate list to get_close_matches(), then it ends up 
having to maintain a mapping between the original string and the normalized 
string. As an example:

>>> from difflib import get_close_matches
>>> word = 'apple'
>>> possibilities = ['apPLE', 'APPLE', 'APE', 'Banana', 'Fruit', 'PEAR', 
>>> 'CoCoNuT']
>>> normalized_possibilities = {p.lower(): p for p in possibilities}
>>> result = get_close_matches(word, normalized_possibilities.keys())
>>> result
['apple', 'ape']
>>> normalized_result = [normalized_possibilities[r] for r in result]
>>> normalized_result
['APPLE', 'APE']

By letting the SequenceMatcher handle the casing on the fly, we could 
potentially save large amounts of memory if someone was providing a huge list 
to get_close_matches.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39891] [difflib] Improve get_close_matches() to better match when casing of words are different

2020-03-13 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
nosy: +rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39891] [difflib] Improve get_close_matches() to better match when casing of words are different

2020-03-13 Thread Roundup Robot


Change by Roundup Robot :


--
keywords: +patch
nosy: +python-dev
nosy_count: 3.0 -> 4.0
pull_requests: +18331
stage: needs patch -> patch review
pull_request: https://github.com/python/cpython/pull/18983

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39891] [difflib] Improve get_close_matches() to better match when casing of words are different

2020-03-08 Thread brian.gallagher


brian.gallagher  added the comment:

I agree that there is an appeal to leaving any normalization to the application 
and that trying guess what people want is a tough hole -- I hadn't even 
considered what casing would mean in a general sense for Unicode.

I'm not entirely convinced that this should be pursued either, but I'll refine 
my proposal, provide a little context in which I thought it could be a problem 
and see what you guys think.

1. Some code is written that assumes get_close_matches() will match on a 
case-insensitive basis. Only a small bit of testing is done because the 
functionality is provided by the standard library not the application code, so 
we throw a few examples like 'apple' and 'ape' and decide it is okay. We later 
on discover we have a bug because we actually need to match against 'AppLE' too.

2. The extension I had in mind was to match on a case-insensitive basis for 
only the alphabet characters. I don't know much about Unicode, but there's 
definitely gotchas lurking in my previous statement (titlecase vs. uppercase) 
so copying the behaviour of string.upper()/string.lower() would seem reasonable 
to me. The functionality would only be extended to match the same strings it 
would anyways, but now ignore casing. We wouldn't be eliminating any existing 
matches. I guess this still has the potential to be a breaking change, since 
someone might indirectly be depending on this.

For 1., not testing that your code can handle mixed case comparisons in the way 
you're assuming it will is probably your own fault. On the other hand, I think 
it is a reasonable assumption to think that get_close_matches() will match an 
uppercase/lowercase counterpart since the function's intent is to provide 
intuitive matches that "look right" to a human. 

Maybe this is more of a documentation issue than something that needs to be 
addressed in the code. If a caveat about the case sensitivity of the function 
is added to the documentation, then a developer can be aware of the limitation 
in order to provide any normalization they want in the application code.

Let me know what you guys think.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39891] [difflib] Improve get_close_matches() to better match when casing of words are different

2020-03-08 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

It looks like Brian is expecting some kind of normalization of the strings 
before they enter the function, e.g. convert to lowercase, remove extra 
whitespace, convert diacritics to regular letters, combinations of such 
normalizations, etc.

Since both "word" and "possibilities" would have to be normalized, I think it's 
better to let the application deal with this efficiently than try to come up 
with a new function or add a normalize keyword function parameter.

--
nosy: +lemburg

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39891] [difflib] Improve get_close_matches() to better match when casing of words are different

2020-03-07 Thread Tim Peters


Tim Peters  added the comment:

If you pursue this, please introduce a new function for it.  You immediately 
have an idea about how to change the current function precisely because it 
_doesn't_ try to guess what you really wanted.  That lack of magic is valuable 
- you're not actually confused by what it does, because it doesn't do anything 
to the strings you give it ;-)  By the same token, if you have a crisp idea of 
how it should treat strings instead, it's straightforward to write a wrapper 
that does so.  The existing function won't fight you by trying to impose its 
own ideas.

Guessing what people really wanted tends to become a bottomless pit.  For 
example, do you know all the rules for what "case" even means in a Unicode 
world?  What about diacritical marks?  And so on.  I don't.

Not saying it shouldn't be pursued.  Am saying it may be hard to reach 
consensus on what's "really" wanted.  By some people, some of the time.  Never 
- alas - by all people all the time.

--
nosy: +tim.peters
stage:  -> needs patch
type:  -> enhancement
versions: +Python 3.9 -Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39891] [difflib] Improve get_close_matches() to better match when casing of words are different

2020-03-07 Thread brian.gallagher


New submission from brian.gallagher :

Currently difflib's get_close_matches() doesn't match similar words that differ 
in their casing very well.

Example:
user@host:~$ python3
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import difflib
>>> difflib.get_close_matches("apple", "APPLE")
[]
>>> difflib.get_close_matches("apple", "APpLe")
[]
>>>

These seem like they should be considered close matches for each other, given 
the SequenceMatcher used in difflib.py attempts to produce a "human-friendly 
diff" of two words in order to yield "intuitive difference reports".

One solution would be for the user of the function to perform their own 
transformation of the supplied data, such as converting all strings to 
lower-case for example. However, it seems like this might be a surprise to a 
user of the function if they weren't aware of this limitation. It would be 
preferable to provide this functionality by default in my eyes.

If this is an issue the relevant maintainer(s) consider worth pursuing, I'd 
love to try my hand at preparing a patch for this.

--
messages: 363618
nosy: brian.gallagher
priority: normal
severity: normal
status: open
title: [difflib] Improve get_close_matches() to better match when casing of 
words are different
versions: Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com