[issue39891] [difflib] Improve get_close_matches() to better match when casing of words are different
brian.gallagher added the comment: Just giving this a bump, in case it has been forgotten about. I've posted a patch at https://github.com/python/cpython/pull/18983. It adds a new parameter "ignorecase" to get_close_matches() that, if set to True, will result in the SequenceMatcher treating any character case insensitively (as determined by str.lower()). The benefit to using this keyword, as opposed to letting the application handle the normalization, is that it saves on memory. If the application has to normalize and supply a separate list to get_close_matches(), then it ends up having to maintain a mapping between the original string and the normalized string. As an example: >>> from difflib import get_close_matches >>> word = 'apple' >>> possibilities = ['apPLE', 'APPLE', 'APE', 'Banana', 'Fruit', 'PEAR', >>> 'CoCoNuT'] >>> normalized_possibilities = {p.lower(): p for p in possibilities} >>> result = get_close_matches(word, normalized_possibilities.keys()) >>> result ['apple', 'ape'] >>> normalized_result = [normalized_possibilities[r] for r in result] >>> normalized_result ['APPLE', 'APE'] By letting the SequenceMatcher handle the casing on the fly, we could potentially save large amounts of memory if someone was providing a huge list to get_close_matches. -- ___ Python tracker <https://bugs.python.org/issue39891> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue39891] [difflib] Improve get_close_matches() to better match when casing of words are different
brian.gallagher added the comment: I agree that there is an appeal to leaving any normalization to the application and that trying guess what people want is a tough hole -- I hadn't even considered what casing would mean in a general sense for Unicode. I'm not entirely convinced that this should be pursued either, but I'll refine my proposal, provide a little context in which I thought it could be a problem and see what you guys think. 1. Some code is written that assumes get_close_matches() will match on a case-insensitive basis. Only a small bit of testing is done because the functionality is provided by the standard library not the application code, so we throw a few examples like 'apple' and 'ape' and decide it is okay. We later on discover we have a bug because we actually need to match against 'AppLE' too. 2. The extension I had in mind was to match on a case-insensitive basis for only the alphabet characters. I don't know much about Unicode, but there's definitely gotchas lurking in my previous statement (titlecase vs. uppercase) so copying the behaviour of string.upper()/string.lower() would seem reasonable to me. The functionality would only be extended to match the same strings it would anyways, but now ignore casing. We wouldn't be eliminating any existing matches. I guess this still has the potential to be a breaking change, since someone might indirectly be depending on this. For 1., not testing that your code can handle mixed case comparisons in the way you're assuming it will is probably your own fault. On the other hand, I think it is a reasonable assumption to think that get_close_matches() will match an uppercase/lowercase counterpart since the function's intent is to provide intuitive matches that "look right" to a human. Maybe this is more of a documentation issue than something that needs to be addressed in the code. If a caveat about the case sensitivity of the function is added to the documentation, then a developer can be aware of the limitation in order to provide any normalization they want in the application code. Let me know what you guys think. -- ___ Python tracker <https://bugs.python.org/issue39891> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue39891] [difflib] Improve get_close_matches() to better match when casing of words are different
New submission from brian.gallagher : Currently difflib's get_close_matches() doesn't match similar words that differ in their casing very well. Example: user@host:~$ python3 Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import difflib >>> difflib.get_close_matches("apple", "APPLE") [] >>> difflib.get_close_matches("apple", "APpLe") [] >>> These seem like they should be considered close matches for each other, given the SequenceMatcher used in difflib.py attempts to produce a "human-friendly diff" of two words in order to yield "intuitive difference reports". One solution would be for the user of the function to perform their own transformation of the supplied data, such as converting all strings to lower-case for example. However, it seems like this might be a surprise to a user of the function if they weren't aware of this limitation. It would be preferable to provide this functionality by default in my eyes. If this is an issue the relevant maintainer(s) consider worth pursuing, I'd love to try my hand at preparing a patch for this. -- messages: 363618 nosy: brian.gallagher priority: normal severity: normal status: open title: [difflib] Improve get_close_matches() to better match when casing of words are different versions: Python 3.6 ___ Python tracker <https://bugs.python.org/issue39891> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue39779] [argparse] Add parameter to sort help output arguments
brian.gallagher added the comment: That makes sense. For what it's worth, the use-case that inspired this was for commands with a lot of optional arguments in a company where a large amount of contributors (who may not be aware of an effort to order the arguments in the source code) were able to make changes to the command. I understand that isn't a particularly compelling reason though, as it can be addressed by other means -- increasing diligence at the code review stage, commit hooks, testing, etc. Thanks for taking a look Raymond. -- ___ Python tracker <https://bugs.python.org/issue39779> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue39779] [argparse] Add parameter to sort help output arguments
New submission from brian.gallagher : 1 import argparse 2 3 parser = argparse.ArgumentParser(description='Test') 4 parser.add_argument('c', help='token c') 5 parser.add_argument('b', help='token b') 6 parser.add_argument('d', help='token d') 7 parser.add_argument('-a', help='token a') 8 parser.add_argument('-z', help='token z') 9 parser.add_argument('-f', help='token f', required=True) 10 parser.print_help() It would be nice if we could have the option to alphabetically sort the tokens in the optional and positional arguments sections of the help message in order to find an argument more quickly when reading long help descriptions. Currently we output the following, when the above program is ran: positional arguments: c token c b token b d token d optional arguments: -h, --help show this help message and exit -a Atoken a -z Ztoken z -f Ftoken f I'm proposing that we provide a mechanism to allow alphabetical ordering of both sections, like so: positional arguments: b token b c token c d token d optional arguments: -h, --help show this help message and exit -a Atoken a -f Ftoken f -z Ztoken z I've chosen to leave -h as an exception, as it will always be there as an optional argument, but it could easily be treated no different. We could provide an optional argument to print_help(sort=False) as a potential approach. If this is something that the maintainer's would be willing to accept, I'd love to take it on and prepare a patch. -- components: Library (Lib) messages: 362849 nosy: brian.gallagher priority: normal severity: normal status: open title: [argparse] Add parameter to sort help output arguments type: enhancement ___ Python tracker <https://bugs.python.org/issue39779> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com