[issue39891] [difflib] Improve get_close_matches() to better match when casing of words are different

brian.gallagher Wed, 08 Apr 2020 15:45:21 -0700


brian.gallagher <oss.brn...@gmail.com> added the comment:


Just giving this a bump, in case it has been forgotten about.

I've posted a patch at https://github.com/python/cpython/pull/18983.

It adds a new parameter "ignorecase" to get_close_matches() that, if set to 
True, will result in the SequenceMatcher treating any character case 
insensitively (as determined by str.lower()).

The benefit to using this keyword, as opposed to letting the application handle 
the normalization, is that it saves on memory. If the application has to 
normalize and supply a separate list to get_close_matches(), then it ends up 
having to maintain a mapping between the original string and the normalized 
string. As an example:

>>> from difflib import get_close_matches
>>> word = 'apple'
>>> possibilities = ['apPLE', 'APPLE', 'APE', 'Banana', 'Fruit', 'PEAR', 
>>> 'CoCoNuT']
>>> normalized_possibilities = {p.lower(): p for p in possibilities}
>>> result = get_close_matches(word, normalized_possibilities.keys())
>>> result
['apple', 'ape']
>>> normalized_result = [normalized_possibilities[r] for r in result]
>>> normalized_result
['APPLE', 'APE']

By letting the SequenceMatcher handle the casing on the fly, we could 
potentially save large amounts of memory if someone was providing a huge list 
to get_close_matches.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue39891>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue39891] [difflib] Improve get_close_matches() to better match when casing of words are different

Reply via email to