John Yeung wrote: > I'm generally pleased with difflib.SequenceMatcher: It's probably not > the best available string matcher out there, but it's in the standard > library and I've seen worse in the wild. One thing that kind of > bothers me is that it's sensitive to which argument you pick as "seq1" > and which you pick as "seq2": > > Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit > (Intel)] on > win32 > Type "help", "copyright", "credits" or "license" for more information. >>>> import difflib >>>> difflib.SequenceMatcher(None, 'BYRD', 'BRADY').ratio() > 0.44444444444444442 >>>> difflib.SequenceMatcher(None, 'BRADY', 'BYRD').ratio() > 0.66666666666666663 >>>> > > Is this a bug? I am guessing the algorithm is implemented correctly, > and that it's just an inherent property of the algorithm used. It's > certainly not what I'd call a desirably property. Are there any > simple adjustments that can be made without sacrificing (too much) > performance?
def symmetric_ratio(a, b, S=difflib.SequenceMatcher): return (S(None, a, b).ratio() + S(None, b, a).ratio())/2.0 I'm expecting 50% performance loss ;) Seriously, have you tried to calculate the ratio with realistic data? Without looking into the source I would expect the two ratios to get more similar. Peter -- http://mail.python.org/mailman/listinfo/python-list