Re: Detecting repeated subsequences of identical items
On Fri, 22 Apr 2016 12:12 am, Chris Angelico wrote: > On Fri, Apr 22, 2016 at 12:01 AM, Oscar Benjamin > wrote: >> In the recursive stack overflow case what you'll usually have is >> >> 1) A few frames leading up to the start of recursion >> 2) A long repetitive sequence of frames >> 3) A few frames at the end showing how the exception was ultimately >> triggered. >> >> You just need to find the cycle that makes that big long sequence. I am not convinced that it is a cycle in the technical sense, or that the two algorithms that Oscar linked to will necessarily find them. They may, by chance, happen to find them sometimes, but the way the algorithms work is by detecting periodic behaviour, and this is not periodic. Look at the comment for Floyd's algorithm here: https://en.wikipedia.org/wiki/Cycle_detection # The hare moves twice as quickly as the tortoise and # the distance between them increases by 1 at each step. # Eventually they will both be inside the cycle and then, ... but that is violated in this scenario. The hare can escape the (non-)cycle before the tortoise even enters it. Even if both enter the "cycle", there's no guarantee that the termination condition `tortoise == hare` will ever be true because the cycle doesn't loop and doesn't go back to the beginning. Hence it is not a cycle. > If the stack got overflowed, there won't usually be a part 3, as part > 2 is the bit that hits sys.recursionlimit (unless increasing the > recursion limit by a finite number would solve the problem). Don't just think of breaking the recursion limit. You might be 100 calls deep in some complex chain of function calls when something raises ValueError. In principle, there might not even be any recursive calls at all. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Detecting repeated subsequences of identical items
On Fri, Apr 22, 2016 at 12:30 AM, Oscar Benjamin wrote: > On 21 April 2016 at 15:12, Chris Angelico wrote: >> On Fri, Apr 22, 2016 at 12:01 AM, Oscar Benjamin >> wrote: >>> In the recursive stack overflow case what you'll usually have is >>> >>> 1) A few frames leading up to the start of recursion >>> 2) A long repetitive sequence of frames >>> 3) A few frames at the end showing how the exception was ultimately >>> triggered. >>> >>> You just need to find the cycle that makes that big long sequence. >> >> If the stack got overflowed, there won't usually be a part 3, as part >> 2 is the bit that hits sys.recursionlimit (unless increasing the >> recursion limit by a finite number would solve the problem). For other >> exceptions, yes, this is what you'd see. > > If you have: > > def f(x): > return g(x+1) > > def g(x): > x = h(x) # <-- stack can overflow inside here > return f(x+1) > > # etc. > > So you have a long sequence that goes f, g, f, g but at the end the > stack can overflow while (not recursively) calling h leaving a small > non-cyclic part at the end. Right, good point. Forgot about that. So this situation can be triggered by anywhere up to lines up. Not as helpful an optimization now. Oh well. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Detecting repeated subsequences of identical items
On 21 April 2016 at 15:12, Chris Angelico wrote: > On Fri, Apr 22, 2016 at 12:01 AM, Oscar Benjamin > wrote: >> In the recursive stack overflow case what you'll usually have is >> >> 1) A few frames leading up to the start of recursion >> 2) A long repetitive sequence of frames >> 3) A few frames at the end showing how the exception was ultimately >> triggered. >> >> You just need to find the cycle that makes that big long sequence. > > If the stack got overflowed, there won't usually be a part 3, as part > 2 is the bit that hits sys.recursionlimit (unless increasing the > recursion limit by a finite number would solve the problem). For other > exceptions, yes, this is what you'd see. If you have: def f(x): return g(x+1) def g(x): x = h(x) # <-- stack can overflow inside here return f(x+1) # etc. So you have a long sequence that goes f, g, f, g but at the end the stack can overflow while (not recursively) calling h leaving a small non-cyclic part at the end. -- Oscar -- https://mail.python.org/mailman/listinfo/python-list
Re: Detecting repeated subsequences of identical items
On Fri, Apr 22, 2016 at 12:01 AM, Oscar Benjamin wrote: > In the recursive stack overflow case what you'll usually have is > > 1) A few frames leading up to the start of recursion > 2) A long repetitive sequence of frames > 3) A few frames at the end showing how the exception was ultimately triggered. > > You just need to find the cycle that makes that big long sequence. If the stack got overflowed, there won't usually be a part 3, as part 2 is the bit that hits sys.recursionlimit (unless increasing the recursion limit by a finite number would solve the problem). For other exceptions, yes, this is what you'd see. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Detecting repeated subsequences of identical items
On 21 April 2016 at 13:15, Steven D'Aprano wrote: > On Thu, 21 Apr 2016 06:53 pm, Oscar Benjamin wrote: > >> On 21 April 2016 at 04:07, Steven D'Aprano wrote: >>> I want to group repeated items in a sequence. For example, I can group >>> repeated sequences of a single item at a time using groupby: >>> >>> >>> from itertools import groupby >>> for key, group in groupby("BBCDDEEE"): >>> group = list(group) >>> print(key, "count =", len(group)) >>> >>> >>> outputs: >>> >>> A count = 4 >>> B count = 2 >>> C count = 1 >>> D count = 2 >>> E count = 3 >>> F count = 4 >>> >>> >>> Now I want to group subsequences. For example, I have: >>> >>> "ABCABCABCDEABCDEFABCABCABCB" >>> >>> and I want to group it into repeating subsequences. I can see two ways to >>> group it: >>> >>> ABC ABC ABCDE ABCDE F ABC ABC ABC B >> >> There are some algorithms (helpfully shown in Python) here: >> >> https://en.wikipedia.org/wiki/Cycle_detection > > It's not necessarily a cycle though. Consider a sequence of function calls: > > def f(x): > return g(x) > > def g(x): > if x < 7: > return h(x) > elif x < 50: > return g(x//2) > else: > return x+f(x-1) > > def h(x): > raise ValueError # oops, a bug > > > and a function call f(54). That will result in the chain of calls: > > f(54) -> g(54) -> f(53) -> g(53) -> f(52) -> g(52) -> f(51) -> g(51) -> > f(50) -> g(50) -> g(25) -> g(12) -> g(6) -> h(6) raises > > So you have that almost-cycle f calls g calls f, but it isn't periodic > because you break out of it. I'd still like to detect the repeated f->g > calls. It doesn't matter that you break out of it. It's periodic for some part and the algorithms listed on that page can find the cycle. In the recursive stack overflow case what you'll usually have is 1) A few frames leading up to the start of recursion 2) A long repetitive sequence of frames 3) A few frames at the end showing how the exception was ultimately triggered. You just need to find the cycle that makes that big long sequence. -- Oscar -- https://mail.python.org/mailman/listinfo/python-list
Re: Detecting repeated subsequences of identical items
On Thu, 21 Apr 2016 06:53 pm, Oscar Benjamin wrote: > On 21 April 2016 at 04:07, Steven D'Aprano wrote: >> I want to group repeated items in a sequence. For example, I can group >> repeated sequences of a single item at a time using groupby: >> >> >> from itertools import groupby >> for key, group in groupby("BBCDDEEE"): >> group = list(group) >> print(key, "count =", len(group)) >> >> >> outputs: >> >> A count = 4 >> B count = 2 >> C count = 1 >> D count = 2 >> E count = 3 >> F count = 4 >> >> >> Now I want to group subsequences. For example, I have: >> >> "ABCABCABCDEABCDEFABCABCABCB" >> >> and I want to group it into repeating subsequences. I can see two ways to >> group it: >> >> ABC ABC ABCDE ABCDE F ABC ABC ABC B > > There are some algorithms (helpfully shown in Python) here: > > https://en.wikipedia.org/wiki/Cycle_detection It's not necessarily a cycle though. Consider a sequence of function calls: def f(x): return g(x) def g(x): if x < 7: return h(x) elif x < 50: return g(x//2) else: return x+f(x-1) def h(x): raise ValueError # oops, a bug and a function call f(54). That will result in the chain of calls: f(54) -> g(54) -> f(53) -> g(53) -> f(52) -> g(52) -> f(51) -> g(51) -> f(50) -> g(50) -> g(25) -> g(12) -> g(6) -> h(6) raises So you have that almost-cycle f calls g calls f, but it isn't periodic because you break out of it. I'd still like to detect the repeated f->g calls. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Detecting repeated subsequences of identical items
On Thu, 21 Apr 2016 18:05:40 +1000, Steven D'Aprano wrote: > The specific problem I am trying to solve is that I have a sequence of > strings (in this case, error messages from a Python traceback) and I'm > looking for repeated groups that may indicate mutually recursive calls. E.g. > suppose I have a function f which calls g, and g calls h, and h calls f > again, and there's an exception, you will see a traceback in part: This is a specific case of finding cycles in a directed graph. But treating it as such probably isn't useful here, because you're interested in a specific traversal of that graph rather than the graph itself. One way to approach it is: sofar = [] for line in traceback: if line in sofar: j = sofar.index(line) if sofar[:j] == sofar[j:j*2]: # found repeat sofar = [line] + sofar Note that sofar needs to be in reverse order, because list doesn't have .rindex() or .rfind(). Detecting nested cycles is somewhat harder because given e.g. ababxabababababxababab you'd want the five repeats of ab in the middle to be treated as two repeats of ab followed by three repeats of ab, but there's no way to spot that until later. -- https://mail.python.org/mailman/listinfo/python-list
Re: Detecting repeated subsequences of identical items
On 21.04.16 06:07, Steven D'Aprano wrote: Now I want to group subsequences. For example, I have: "ABCABCABCDEABCDEFABCABCABCB" and I want to group it into repeating subsequences. [...] How can I do this? Does this problem have a standard name and/or solution? This is a part of lossless data compression algorithms. See for example LZ, LZW. -- https://mail.python.org/mailman/listinfo/python-list
Re: Detecting repeated subsequences of identical items
On 21 April 2016 at 04:07, Steven D'Aprano wrote: > I want to group repeated items in a sequence. For example, I can group > repeated sequences of a single item at a time using groupby: > > > from itertools import groupby > for key, group in groupby("BBCDDEEE"): > group = list(group) > print(key, "count =", len(group)) > > > outputs: > > A count = 4 > B count = 2 > C count = 1 > D count = 2 > E count = 3 > F count = 4 > > > Now I want to group subsequences. For example, I have: > > "ABCABCABCDEABCDEFABCABCABCB" > > and I want to group it into repeating subsequences. I can see two ways to > group it: > > ABC ABC ABCDE ABCDE F ABC ABC ABC B There are some algorithms (helpfully shown in Python) here: https://en.wikipedia.org/wiki/Cycle_detection Note that those are for a sequence made as x[n+1] = f(x[n]) for some function f. In your case that's just the function that gets the next frame up/down in the call stack. -- Oscar -- Oscar -- https://mail.python.org/mailman/listinfo/python-list
Re: Detecting repeated subsequences of identical items
On Thursday 21 April 2016 16:35, Michael Selik wrote: > On Wed, Apr 20, 2016 at 11:11 PM Steven D'Aprano > wrote: > >> I want to group [repeated] subsequences. For example, I have: >> "ABCABCABCDEABCDEFABCABCABCB" >> and I want to group it into repeating subsequences. I can see two >> ways... How can I do this? Does this problem have a standard name and/or >> solution? >> > > I'm not aware of a standard name. This sounds like an unsupervised > learning problem. There's no objectively correct answer unless you add > more specificity to the problem statement. > > Regexes may sound tempting at first, Ah, I may have mislead you all. I cannot use regexes, since the *actual* sequences I'm working on are sequences (lists, tuples, etc) or iterators of arbitrary items. The items themselves may be strings, but the sequence itself is definitely not a string. I just showed a string for convenience. Sorry about that. So no regexes. > but because a repeating subsequence > may have nested repeating subsequences and this can go on infinitely, I > think we at least need a push-down automata. Fortunately, for my *immediate* problem, I would be good with some fairly arbitrary restrictions on subsequence detection. > I checked out some links for clustering algorithms that work on series > subsequences and I found some fun results. > > Clustering is meaningless! > http://www.cs.ucr.edu/~eamonn/meaningless.pdf > > I think you're in "no free lunch" territory. "Clustering of subsequence > time series remains an open issue in time series clustering" > http://www.hindawi.com/journals/tswj/2014/312521/ > > Any more detail on the problem to add constraints? The specific problem I am trying to solve is that I have a sequence of strings (in this case, error messages from a Python traceback) and I'm looking for repeated groups that may indicate mutually recursive calls. E.g. suppose I have a function f which calls g, and g calls h, and h calls f again, and there's an exception, you will see a traceback in part: File "", line 2, in f File "", line 5, in g File "", line 9, in h File "", line 2, in f File "", line 5, in g File "", line 9, in h File "", line 2, in f File "", line 5, in g File "", line 9, in h File "", line 7, in f File "", line 5, in g File "", line 9, in h etc. Note that I only care about lines which are identical, e.g. if the line numbers differ (as in the last call to f), they will be treated as distinct items. So I'd like to group the above as: File "", line 2, in f File "", line 5, in g File "", line 9, in h *** above 3 calls repeated 3 times *** File "", line 7, in f File "", line 5, in g File "", line 9, in h -- Steve -- https://mail.python.org/mailman/listinfo/python-list
Re: Detecting repeated subsequences of identical items
Steven D'Aprano writes: > I want to group repeated items in a sequence. For example, I can group > repeated sequences of a single item at a time using groupby: [...] > Now I want to group subsequences. For example, I have: > > "ABCABCABCDEABCDEFABCABCABCB" > > and I want to group it into repeating subsequences. I can see two ways to > group it: > > ABC ABC ABCDE ABCDE F ABC ABC ABC B [...] > or: > > ABC ABC ABC D E A B C D E F ABC ABC ABC B [...] > How can I do this? Does this problem have a standard name and/or solution? Looks like a tough one. I don't think it has a name (I'm not even sure to be able to formally define it). Lets say it is an instance of "longest repeating substring" (https://en.wikipedia.org/wiki/Longest_repeated_substring_problem -- which really does not say much). Anyway, it looks like a job for a suffix trees. Depending on what you are after, you may also be interested in the sequitur algorithm (http://www.sequitur.info/). -- Alain. -- https://mail.python.org/mailman/listinfo/python-list
Re: Detecting repeated subsequences of identical items
On Thu, Apr 21, 2016 at 2:55 AM Vlastimil Brom wrote: > 2016-04-21 5:07 GMT+02:00 Steven D'Aprano : > > I want to group subsequences. > > "ABCABCABCDEABCDEFABCABCABCB" > > ABC ABC ABCDE ABCDE F ABC ABC ABC B > > or: > > ABC ABC ABC D E A B C D E F ABC ABC ABC B > > if I am not missing something, the latter form of grouping might be > achieved with the following regex: [snip] > The former one seems to be more tricky... > Right. If the problem is constrained to say that repeated subsequences can have no nested repeated subsequences, it's much easier to solve. If you had "ABCABCABCABC" should that result in ABC ABC ABC ABC, with 4 repetitions or ABCABC ABCABC with 2 repetitions? In this example, one might say the higher count is obviously better, but I think it depends on the context. Maybe the user is looking for the biggest patterns rather than the biggest counts. -- https://mail.python.org/mailman/listinfo/python-list
Re: Detecting repeated subsequences of identical items
2016-04-21 5:07 GMT+02:00 Steven D'Aprano : > I want to group repeated items in a sequence. For example, I can group > repeated sequences of a single item at a time using groupby: > > > from itertools import groupby > for key, group in groupby("BBCDDEEE"): > group = list(group) > print(key, "count =", len(group)) > > > outputs: > > A count = 4 > B count = 2 > C count = 1 > D count = 2 > E count = 3 > F count = 4 > > > Now I want to group subsequences. For example, I have: > > "ABCABCABCDEABCDEFABCABCABCB" > > and I want to group it into repeating subsequences. I can see two ways to > group it: > > ABC ABC ABCDE ABCDE F ABC ABC ABC B > > giving counts: > > (ABC) count = 2 > (ABCDE) count = 2 > F count = 1 > (ABC) count = 3 > B repeats 1 time > > > or: > > ABC ABC ABC D E A B C D E F ABC ABC ABC B > > giving counts: > > (ABC) count = 3 > D count = 1 > E count = 1 > A count = 1 > B count = 1 > C count = 1 > D count = 1 > E count = 1 > F count = 1 > (ABC) count = 3 > B count = 1 > > > > How can I do this? Does this problem have a standard name and/or solution? > > > > > -- > Steven > > -- > https://mail.python.org/mailman/listinfo/python-list Hi, if I am not missing something, the latter form of grouping might be achieved with the following regex: t="ABCABCABCDEABCDEFABCABCABCB" grouped = re.findall(r"((?:(\w+?)\2+)|\w+?)", t) print(grouped) for grp, subseq in grouped: if subseq: print(subseq, grp.count(subseq)) else: print(grp, "1") the printed output is: [('ABCABCABC', 'ABC'), ('D', ''), ('E', ''), ('A', ''), ('B', ''), ('C', ''), ('D', ''), ('E', ''), ('F', ''), ('ABCABCABC', 'ABC'), ('B', '')] ABC 3 D 1 E 1 A 1 B 1 C 1 D 1 E 1 F 1 ABC 3 B 1 The former one seems to be more tricky... hth, vbr -- https://mail.python.org/mailman/listinfo/python-list
Re: Detecting repeated subsequences of identical items
On Thu, Apr 21, 2016 at 2:35 AM Michael Selik wrote: > On Wed, Apr 20, 2016 at 11:11 PM Steven D'Aprano > wrote: > >> I want to group [repeated] subsequences. For example, I have: >> "ABCABCABCDEABCDEFABCABCABCB" >> and I want to group it into repeating subsequences. I can see two >> ways... How can I do this? Does this problem have a standard name and/or >> solution? >> > > I'm not aware of a standard name. This sounds like an unsupervised > learning problem. There's no objectively correct answer unless you add more > specificity to the problem statement. > > Regexes may sound tempting at first, but because a repeating subsequence > may have nested repeating subsequences and this can go on infinitely, I > think we at least need a push-down automata. > > I checked out some links for clustering algorithms that work on series > subsequences and I found some fun results. > > Clustering is meaningless! > http://www.cs.ucr.edu/~eamonn/meaningless.pdf > > I think you're in "no free lunch" territory. "Clustering of subsequence > time series remains an open issue in time series clustering" > http://www.hindawi.com/journals/tswj/2014/312521/ > > Any more detail on the problem to add constraints? > Some light reading suggests that you can improve your problem by defining a minimum size for a subsequence to qualify. One paper suggests calling these more interesting repetitions a "motif" to use a music metaphor. Looking for any repetitions results in too many trivial results. Is that valid for your usage? -- https://mail.python.org/mailman/listinfo/python-list
Re: Detecting repeated subsequences of identical items
On Wed, Apr 20, 2016 at 11:11 PM Steven D'Aprano wrote: > I want to group [repeated] subsequences. For example, I have: > "ABCABCABCDEABCDEFABCABCABCB" > and I want to group it into repeating subsequences. I can see two > ways... How can I do this? Does this problem have a standard name and/or > solution? > I'm not aware of a standard name. This sounds like an unsupervised learning problem. There's no objectively correct answer unless you add more specificity to the problem statement. Regexes may sound tempting at first, but because a repeating subsequence may have nested repeating subsequences and this can go on infinitely, I think we at least need a push-down automata. I checked out some links for clustering algorithms that work on series subsequences and I found some fun results. Clustering is meaningless! http://www.cs.ucr.edu/~eamonn/meaningless.pdf I think you're in "no free lunch" territory. "Clustering of subsequence time series remains an open issue in time series clustering" http://www.hindawi.com/journals/tswj/2014/312521/ Any more detail on the problem to add constraints? -- https://mail.python.org/mailman/listinfo/python-list
Re: Detecting repeated subsequences of identical items
On Thu, Apr 21, 2016 at 1:07 PM, Steven D'Aprano wrote: > Now I want to group subsequences. For example, I have: > > "ABCABCABCDEABCDEFABCABCABCB" > > and I want to group it into repeating subsequences. I can see two ways to > group it: > > ABC ABC ABCDE ABCDE F ABC ABC ABC B > > or: > > ABC ABC ABC D E A B C D E F ABC ABC ABC B Interesting. I've *almost* managed to (ab)use re.split for this purpose. A one-step solution can be done with re.match: >>> txt = "ABCABCABCDEABCDEFABCABCABCB" >>> re.match(r'(.+)\1+', txt) <_sre.SRE_Match object; span=(0, 9), match='ABCABCABC'> But split then returns only the grouped part: >>> re.split(r'(.+)\1+', txt) ['', 'ABC', 'DEABCDEF', 'ABC', 'B'] or *all* the grouped parts: >>> re.split(r'((.+)\2+)', txt) ['', 'ABCABCABC', 'ABC', 'DEABCDEF', 'ABCABCABC', 'ABC', 'B'] There's definitely a partial solution happening here, but I can't quite make it work. And no, I don't know if there's a standard name for it. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Detecting repeated subsequences of identical items
On 04/20/2016 08:57 PM, Ethan Furman wrote: > [snip same pattern as Steven wrote] Nevermind. It's obviously time for me to go to bed. :/ -- ~Ethan~ -- https://mail.python.org/mailman/listinfo/python-list
Re: Detecting repeated subsequences of identical items
On 04/20/2016 08:07 PM, Steven D'Aprano wrote: Now I want to group subsequences. For example, I have: "ABCABCABCDEABCDEFABCABCABCB" and I want to group it into repeating subsequences. I can see two ways to group it: ABC ABC ABCDE ABCDE F ABC ABC ABC B giving counts: (ABC) count = 2 (ABCDE) count = 2 F count = 1 (ABC) count = 3 B repeats 1 time or ABC ABC ABC D E A B C D E F ABC ABC B -- ~Ethan~ -- https://mail.python.org/mailman/listinfo/python-list