On Thursday, November 3, 2016 at 3:47:41 PM UTC-7, [email protected] wrote:
> On Thursday, November 3, 2016 at 1:09:48 PM UTC-7, Neil D. Cerutti wrote:
> > you may also be
> > able to use some items "off the shelf" from Python's difflib.
>
> I wasn't aware of that module, thanks for the tip!
>
> difflib.SequenceMatcher.ratio() returns a numerical value which represents
> the "similarity" between two strings. I don't see a precise definition of
> "similar", but it may do what the OP needs.
Following up to myself... I just experimented with
difflib.SequenceMatcher.ratio() and discovered something. The algorithm is not
"commutative." That is, it doesn't ALWAYS produce the same ratio when the two
strings are swapped.
Here's an excerpt from my interpreter session.
==========
In [1]: from difflib import SequenceMatcher
In [2]: import numpy as np
In [3]: sim = np.zeros((4,4))
== snip ==
In [10]: strings
Out[10]:
('Here is a string.',
'Here is a slightly different string.',
'This string should be significantly different from the other two?',
"Let's look at all these string similarity values in a matrix.")
In [11]: for r, s1 in enumerate(strings):
....: for c, s2 in enumerate(strings):
....: m = SequenceMatcher(lambda x:x=="", s1, s2)
....: sim[r,c] = m.ratio()
....:
In [12]: sim
Out[12]:
array([[ 1. , 0.64150943, 0.2195122 , 0.30769231],
[ 0.64150943, 1. , 0.47524752, 0.30927835],
[ 0.2195122 , 0.45544554, 1. , 0.28571429],
[ 0.30769231, 0.28865979, 0.33333333, 1. ]])
==========
The values along the matrix diagonal, of course, are all ones, because each
string was compared to itself.
I also expected the values reflected across the matrix diagonal to match. The
first row does in fact match the first column. The remaining numbers disagree
somewhat. The differences are not large, but they are there. I don't know the
reason why. Caveat programmer.
--
https://mail.python.org/mailman/listinfo/python-list