[lang] Longest common substring / Suffix Tree

Thomas Neidhart Mon, 12 Mar 2012 13:20:39 -0700

Hi,

on the weekend, I started to work on issue LANG-680
(https://issues.apache.org/jira/browse/LANG-680), which is about adding
support for finding the longest common substring of a set of Strings.


Suffix Trees are a standard data structure to efficiently solve this
problem, and I created a variant of Ukkonen's algorithm to construct
such a tree in linear time.

After some research to create a generalized version of it (to support
multiple strings), I found a very nice implementation from Alessandro
Bahgat (https://github.com/abahgat/suffixtree), which is very similar to
my own, and contacted the author if he is interested to put his code
under apache license (which he did already).

So the question basically is how to proceed, as adding a data structure
like a Suffix Tree to commons-lang may be out-of-scope. On the other
hand there are several string related problems that can be efficiently
solved with a Suffix Tree:

 * longest common substring
 * longest repeated substring
 * exact string set matching
 * palindrom finding
 * finding frequent substrings

Implementing a longest common substring for two strings is trivial
(using dynamic programming ~20 lines of code), but for a set of strings
a suffix tree would be needed.

Henri already outlined a possible API as comment to the issue, which I
like, so what do other people think about adding a suffix tree to
commons-lang?

Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

[lang] Longest common substring / Suffix Tree

Reply via email to