Re: Ensuring symmetry in difflib.SequenceMatcher

2010-11-24 Thread John Machin
On Nov 24, 8:43 pm, Peter Otten <__pete...@web.de> wrote:
> John Yeung wrote:
> > I'm generally pleased with difflib.SequenceMatcher:  It's probably not
> > the best available string matcher out there, but it's in the standard
> > library and I've seen worse in the wild.  One thing that kind of
> > bothers me is that it's sensitive to which argument you pick as "seq1"
> > and which you pick as "seq2":
>
> > Python 2.6.1 (r261:67517, Dec  4 2008, 16:51:00) [MSC v.1500 32 bit
> > (Intel)] on
> > win32
> > Type "help", "copyright", "credits" or "license" for more information.
>  import difflib
>  difflib.SequenceMatcher(None, 'BYRD', 'BRADY').ratio()
> > 0.2
>  difflib.SequenceMatcher(None, 'BRADY', 'BYRD').ratio()
> > 0.3
>
> > Is this a bug?  I am guessing the algorithm is implemented correctly,
> > and that it's just an inherent property of the algorithm used.  It's
> > certainly not what I'd call a desirably property.  Are there any
> > simple adjustments that can be made without sacrificing (too much)
> > performance?
>
> def symmetric_ratio(a, b, S=difflib.SequenceMatcher):
>     return (S(None, a, b).ratio() + S(None, b, a).ratio())/2.0
>
> I'm expecting 50% performance loss ;)
>
> Seriously, have you tried to calculate the ratio with realistic data?
> Without looking into the source I would expect the two ratios to get more
> similar.
>
> Peter

Surnames are extremely realistic data. The OP should consider using
Levenshtein distance, which is "symmetric". A good (non-naive)
implementation should be much faster than difflib.

ratio = 1.0 - levenshtein(a, b) / float(max(len(a), len(b)))
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ensuring symmetry in difflib.SequenceMatcher

2010-11-24 Thread Peter Otten
John Yeung wrote:

> I'm generally pleased with difflib.SequenceMatcher:  It's probably not
> the best available string matcher out there, but it's in the standard
> library and I've seen worse in the wild.  One thing that kind of
> bothers me is that it's sensitive to which argument you pick as "seq1"
> and which you pick as "seq2":
> 
> Python 2.6.1 (r261:67517, Dec  4 2008, 16:51:00) [MSC v.1500 32 bit
> (Intel)] on
> win32
> Type "help", "copyright", "credits" or "license" for more information.
 import difflib
 difflib.SequenceMatcher(None, 'BYRD', 'BRADY').ratio()
> 0.2
 difflib.SequenceMatcher(None, 'BRADY', 'BYRD').ratio()
> 0.3

> 
> Is this a bug?  I am guessing the algorithm is implemented correctly,
> and that it's just an inherent property of the algorithm used.  It's
> certainly not what I'd call a desirably property.  Are there any
> simple adjustments that can be made without sacrificing (too much)
> performance?

def symmetric_ratio(a, b, S=difflib.SequenceMatcher):
return (S(None, a, b).ratio() + S(None, b, a).ratio())/2.0

I'm expecting 50% performance loss ;)

Seriously, have you tried to calculate the ratio with realistic data? 
Without looking into the source I would expect the two ratios to get more 
similar.

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ensuring symmetry in difflib.SequenceMatcher

2010-11-24 Thread Wolfgang Rohdewald
On Mittwoch 24 November 2010, John Yeung wrote:
>  Are there any
> simple adjustments that can be made without sacrificing (too
> much) performance?

>>> difflib.SequenceMatcher(None,*sorted(('BYRD','BRADY'))).ratio()
0.3

-- 
Wolfgang
-- 
http://mail.python.org/mailman/listinfo/python-list


Ensuring symmetry in difflib.SequenceMatcher

2010-11-23 Thread John Yeung
I'm generally pleased with difflib.SequenceMatcher:  It's probably not
the best available string matcher out there, but it's in the standard
library and I've seen worse in the wild.  One thing that kind of
bothers me is that it's sensitive to which argument you pick as "seq1"
and which you pick as "seq2":

Python 2.6.1 (r261:67517, Dec  4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import difflib
>>> difflib.SequenceMatcher(None, 'BYRD', 'BRADY').ratio()
0.2
>>> difflib.SequenceMatcher(None, 'BRADY', 'BYRD').ratio()
0.3
>>>

Is this a bug?  I am guessing the algorithm is implemented correctly,
and that it's just an inherent property of the algorithm used.  It's
certainly not what I'd call a desirably property.  Are there any
simple adjustments that can be made without sacrificing (too much)
performance?

John
-- 
http://mail.python.org/mailman/listinfo/python-list