On 2011-11-11 16:53, Jerry Hill wrote:
There's nothing wrong with writing your own code to find the longest common
substring, but are you aware that python has a module in the standard
library that already does this?  In the difflib module, the SequenceMatcher
class can compare two sequences and extract the longest common sequence of
elements from it, like this:

Thanks for the tip. I've played around with it, but I think it doesn't help in the OP's situation. "SequenceMatcher.find_longest_match()" just finds the first common block:

Python 2.7.1+ (r271:86832, Apr 11 2011, 18:05:24)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import difflib
>>> first = [0, 1, 2, 3, 0, 4, 5, 6, 0]
>>> second = [1, 2, 3, 4, 5, 6]
>>> match = difflib.SequenceMatcher(None, first, second)
>>> match.find_longest_match(0, len(first), 0, len(second))
Match(a=1, b=0, size=3)

Here it returns just [1, 2, 3] but misses [4, 5, 6]. So you would have to adjust the lower limits to get it. "SequenceMatcher.get_matching_blocks()" seems to be a better choice:

>>> match.get_matching_blocks()
[Match(a=1, b=0, size=3), Match(a=5, b=3, size=3), Match(a=9, b=6, size=0)]

Now you get [1, 2, 3] and [4, 5, 6]. But if the two blocks are in the reversed order, there is no longest common subsequence [1, 2, 3, 4, 5, 6] any more and "SequenceMatcher" only finds one part (apparently it chooses the first it comes across in the first list if both have the same length):

>>> first = [0, 1, 2, 3, 0, 4, 5, 6, 0]
>>> second = [4, 5, 6, 1, 2, 3]
>>> match = difflib.SequenceMatcher(None, first, second)
>>> match.find_longest_match(0, len(first), 0, len(second))
Match(a=1, b=3, size=3)
>>> match.get_matching_blocks()
[Match(a=1, b=3, size=3), Match(a=9, b=6, size=0)]

From both methods you get [1, 2, 3].

As I've learnt during this tests, there is a difference between subsequences and substrings:
http://en.wikipedia.org/wiki/Subsequence#Substring_vs._subsequence

If I've understood the OP right, he/she wants to find all common substrings with a minimum length regardless of their order in the strings.

Bye, Andreas
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to