New submission from Giacomo <baldogiacomo...@gmail.com>:
Here I propose a new function, namely .ratio_min(self,m). .ratio_min(self,m) is an extension of the difflib's function .ratio(self). Equivalently to .ratio(self), .ratio_min(self,m) returns a measure of two sequences' similarity (float in [0,1]). In addition to .ratio(), it can ignore matched substrings if these substrings have length less than a given threshold m. m is the second variable of the function. It is very useful to avoid spurious high similarity scores. # NEW FUNCTION: def ratio_min(self,m): """Return a measure of the sequences' similarity (float in [0,1]). Where T is the total number of elements in both sequences, and M_min is the number of matches with every single match has length at least m, this is 2.0*M_min / T. Note that this is 1 if the sequences are identical, and 0 if they have no substring of length m or more in common. .ratio_min() is similar to .ratio(). .ratio_min(1) is equivalent to .ratio(). >>> s = SequenceMatcher(None, "abcd", "bcde") >>> s.ratio_min(1) 0.75 >>> s.ratio_min(2) 0.75 >>> s.ratio_min(3) 0.75 >>> s.ratio_min(4) 0.0 """ matches = sum(triple[-1] for triple in self.get_matching_blocks() if triple[-1] >=m) return _calculate_ratio(matches, len(self.a) + len(self.b)) ---------- components: Library (Lib) messages: 408622 nosy: gibu priority: normal severity: normal status: open title: Add ratio_min() function to the difflib library type: enhancement versions: Python 3.10, Python 3.11 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue46086> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com