On Thu, Apr 21, 2016 at 2:35 AM Michael Selik <michael.se...@gmail.com> wrote:
> On Wed, Apr 20, 2016 at 11:11 PM Steven D'Aprano <st...@pearwood.info> > wrote: > >> I want to group [repeated] subsequences. For example, I have: >> "ABCABCABCDEABCDEFABCABCABCB" >> and I want to group it into repeating subsequences. I can see two >> ways... How can I do this? Does this problem have a standard name and/or >> solution? >> > > I'm not aware of a standard name. This sounds like an unsupervised > learning problem. There's no objectively correct answer unless you add more > specificity to the problem statement. > > Regexes may sound tempting at first, but because a repeating subsequence > may have nested repeating subsequences and this can go on infinitely, I > think we at least need a push-down automata. > > I checked out some links for clustering algorithms that work on series > subsequences and I found some fun results. > > Clustering is meaningless! > http://www.cs.ucr.edu/~eamonn/meaningless.pdf > > I think you're in "no free lunch" territory. "Clustering of subsequence > time series remains an open issue in time series clustering" > http://www.hindawi.com/journals/tswj/2014/312521/ > > Any more detail on the problem to add constraints? > Some light reading suggests that you can improve your problem by defining a minimum size for a subsequence to qualify. One paper suggests calling these more interesting repetitions a "motif" to use a music metaphor. Looking for any repetitions results in too many trivial results. Is that valid for your usage? -- https://mail.python.org/mailman/listinfo/python-list