Re: Detecting repeated subsequences of identical items

2016-04-22 Thread Steven D'Aprano
On Fri, 22 Apr 2016 12:12 am, Chris Angelico wrote:

> On Fri, Apr 22, 2016 at 12:01 AM, Oscar Benjamin
>  wrote:
>> In the recursive stack overflow case what you'll usually have is
>>
>> 1) A few frames leading up to the start of recursion
>> 2) A long repetitive sequence of frames
>> 3) A few frames at the end showing how the exception was ultimately
>> triggered.
>>
>> You just need to find the cycle that makes that big long sequence.

I am not convinced that it is a cycle in the technical sense, or that the
two algorithms that Oscar linked to will necessarily find them. They may,
by chance, happen to find them sometimes, but the way the algorithms work
is by detecting periodic behaviour, and this is not periodic.

Look at the comment for Floyd's algorithm here:

https://en.wikipedia.org/wiki/Cycle_detection

# The hare moves twice as quickly as the tortoise and
# the distance between them increases by 1 at each step.
# Eventually they will both be inside the cycle and then, ...


but that is violated in this scenario. The hare can escape the (non-)cycle
before the tortoise even enters it. Even if both enter the "cycle", there's
no guarantee that the termination condition `tortoise == hare` will ever be
true because the cycle doesn't loop and doesn't go back to the beginning.
Hence it is not a cycle.

 
> If the stack got overflowed, there won't usually be a part 3, as part 
> 2 is the bit that hits sys.recursionlimit (unless increasing the
> recursion limit by a finite number would solve the problem). 

Don't just think of breaking the recursion limit. You might be 100 calls
deep in some complex chain of function calls when something raises
ValueError. In principle, there might not even be any recursive calls at
all.


-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Detecting repeated subsequences of identical items

2016-04-21 Thread Chris Angelico
On Fri, Apr 22, 2016 at 12:30 AM, Oscar Benjamin
 wrote:
> On 21 April 2016 at 15:12, Chris Angelico  wrote:
>> On Fri, Apr 22, 2016 at 12:01 AM, Oscar Benjamin
>>  wrote:
>>> In the recursive stack overflow case what you'll usually have is
>>>
>>> 1) A few frames leading up to the start of recursion
>>> 2) A long repetitive sequence of frames
>>> 3) A few frames at the end showing how the exception was ultimately 
>>> triggered.
>>>
>>> You just need to find the cycle that makes that big long sequence.
>>
>> If the stack got overflowed, there won't usually be a part 3, as part
>> 2 is the bit that hits sys.recursionlimit (unless increasing the
>> recursion limit by a finite number would solve the problem). For other
>> exceptions, yes, this is what you'd see.
>
> If you have:
>
> def f(x):
> return g(x+1)
>
> def g(x):
> x = h(x)  # <-- stack can overflow inside here
> return f(x+1)
>
> # etc.
>
> So you have a long sequence that goes f, g, f, g but at the end the
> stack can overflow while (not recursively) calling h leaving a small
> non-cyclic part at the end.

Right, good point. Forgot about that. So this situation can be
triggered by anywhere up to  lines up. Not as helpful an
optimization now. Oh well.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Detecting repeated subsequences of identical items

2016-04-21 Thread Oscar Benjamin
On 21 April 2016 at 15:12, Chris Angelico  wrote:
> On Fri, Apr 22, 2016 at 12:01 AM, Oscar Benjamin
>  wrote:
>> In the recursive stack overflow case what you'll usually have is
>>
>> 1) A few frames leading up to the start of recursion
>> 2) A long repetitive sequence of frames
>> 3) A few frames at the end showing how the exception was ultimately 
>> triggered.
>>
>> You just need to find the cycle that makes that big long sequence.
>
> If the stack got overflowed, there won't usually be a part 3, as part
> 2 is the bit that hits sys.recursionlimit (unless increasing the
> recursion limit by a finite number would solve the problem). For other
> exceptions, yes, this is what you'd see.

If you have:

def f(x):
return g(x+1)

def g(x):
x = h(x)  # <-- stack can overflow inside here
return f(x+1)

# etc.

So you have a long sequence that goes f, g, f, g but at the end the
stack can overflow while (not recursively) calling h leaving a small
non-cyclic part at the end.

--
Oscar
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Detecting repeated subsequences of identical items

2016-04-21 Thread Chris Angelico
On Fri, Apr 22, 2016 at 12:01 AM, Oscar Benjamin
 wrote:
> In the recursive stack overflow case what you'll usually have is
>
> 1) A few frames leading up to the start of recursion
> 2) A long repetitive sequence of frames
> 3) A few frames at the end showing how the exception was ultimately triggered.
>
> You just need to find the cycle that makes that big long sequence.

If the stack got overflowed, there won't usually be a part 3, as part
2 is the bit that hits sys.recursionlimit (unless increasing the
recursion limit by a finite number would solve the problem). For other
exceptions, yes, this is what you'd see.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Detecting repeated subsequences of identical items

2016-04-21 Thread Oscar Benjamin
On 21 April 2016 at 13:15, Steven D'Aprano  wrote:
> On Thu, 21 Apr 2016 06:53 pm, Oscar Benjamin wrote:
>
>> On 21 April 2016 at 04:07, Steven D'Aprano  wrote:
>>> I want to group repeated items in a sequence. For example, I can group
>>> repeated sequences of a single item at a time using groupby:
>>>
>>>
>>> from itertools import groupby
>>> for key, group in groupby("BBCDDEEE"):
>>> group = list(group)
>>> print(key, "count =", len(group))
>>>
>>>
>>> outputs:
>>>
>>> A count = 4
>>> B count = 2
>>> C count = 1
>>> D count = 2
>>> E count = 3
>>> F count = 4
>>>
>>>
>>> Now I want to group subsequences. For example, I have:
>>>
>>> "ABCABCABCDEABCDEFABCABCABCB"
>>>
>>> and I want to group it into repeating subsequences. I can see two ways to
>>> group it:
>>>
>>> ABC ABC ABCDE ABCDE F ABC ABC ABC B
>>
>> There are some algorithms (helpfully shown in Python) here:
>>
>> https://en.wikipedia.org/wiki/Cycle_detection
>
> It's not necessarily a cycle though. Consider a sequence of function calls:
>
> def f(x):
> return g(x)
>
> def g(x):
> if x < 7:
> return h(x)
> elif x < 50:
> return g(x//2)
> else:
> return x+f(x-1)
>
> def h(x):
> raise ValueError  # oops, a bug
>
>
> and a function call f(54). That will result in the chain of calls:
>
> f(54) -> g(54) -> f(53) -> g(53) -> f(52) -> g(52) -> f(51) -> g(51) ->
> f(50) -> g(50) -> g(25) -> g(12) -> g(6) -> h(6) raises
>
> So you have that almost-cycle f calls g calls f, but it isn't periodic
> because you break out of it. I'd still like to detect the repeated f->g
> calls.

It doesn't matter that you break out of it. It's periodic for some
part and the algorithms listed on that page can find the cycle. In the
recursive stack overflow case what you'll usually have is

1) A few frames leading up to the start of recursion
2) A long repetitive sequence of frames
3) A few frames at the end showing how the exception was ultimately triggered.

You just need to find the cycle that makes that big long sequence.

--
Oscar
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Detecting repeated subsequences of identical items

2016-04-21 Thread Steven D'Aprano
On Thu, 21 Apr 2016 06:53 pm, Oscar Benjamin wrote:

> On 21 April 2016 at 04:07, Steven D'Aprano  wrote:
>> I want to group repeated items in a sequence. For example, I can group
>> repeated sequences of a single item at a time using groupby:
>>
>>
>> from itertools import groupby
>> for key, group in groupby("BBCDDEEE"):
>> group = list(group)
>> print(key, "count =", len(group))
>>
>>
>> outputs:
>>
>> A count = 4
>> B count = 2
>> C count = 1
>> D count = 2
>> E count = 3
>> F count = 4
>>
>>
>> Now I want to group subsequences. For example, I have:
>>
>> "ABCABCABCDEABCDEFABCABCABCB"
>>
>> and I want to group it into repeating subsequences. I can see two ways to
>> group it:
>>
>> ABC ABC ABCDE ABCDE F ABC ABC ABC B
> 
> There are some algorithms (helpfully shown in Python) here:
> 
> https://en.wikipedia.org/wiki/Cycle_detection

It's not necessarily a cycle though. Consider a sequence of function calls:

def f(x):
return g(x)

def g(x):
if x < 7:
return h(x)
elif x < 50:
return g(x//2)
else:
return x+f(x-1)

def h(x):
raise ValueError  # oops, a bug


and a function call f(54). That will result in the chain of calls:

f(54) -> g(54) -> f(53) -> g(53) -> f(52) -> g(52) -> f(51) -> g(51) ->
f(50) -> g(50) -> g(25) -> g(12) -> g(6) -> h(6) raises

So you have that almost-cycle f calls g calls f, but it isn't periodic
because you break out of it. I'd still like to detect the repeated f->g
calls.




-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Detecting repeated subsequences of identical items

2016-04-21 Thread Nobody
On Thu, 21 Apr 2016 18:05:40 +1000, Steven D'Aprano wrote:

> The specific problem I am trying to solve is that I have a sequence of 
> strings (in this case, error messages from a Python traceback) and I'm 
> looking for repeated groups that may indicate mutually recursive calls. E.g. 
> suppose I have a function f which calls g, and g calls h, and h calls f 
> again, and there's an exception, you will see a traceback in part:

This is a specific case of finding cycles in a directed graph. But
treating it as such probably isn't useful here, because you're interested
in a specific traversal of that graph rather than the graph itself.

One way to approach it is:

sofar = []
for line in traceback:
if line in sofar:
j = sofar.index(line)
if sofar[:j] == sofar[j:j*2]:
# found repeat
sofar = [line] + sofar

Note that sofar needs to be in reverse order, because list doesn't have
.rindex() or .rfind().

Detecting nested cycles is somewhat harder because given e.g.

ababxabababababxababab

you'd want the five repeats of ab in the middle to be treated as two
repeats of ab followed by three repeats of ab, but there's no way to
spot that until later.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Detecting repeated subsequences of identical items

2016-04-21 Thread Serhiy Storchaka

On 21.04.16 06:07, Steven D'Aprano wrote:

Now I want to group subsequences. For example, I have:

"ABCABCABCDEABCDEFABCABCABCB"

and I want to group it into repeating subsequences.


[...]


How can I do this? Does this problem have a standard name and/or solution?


This is a part of lossless data compression algorithms. See for example 
LZ, LZW.


--
https://mail.python.org/mailman/listinfo/python-list


Re: Detecting repeated subsequences of identical items

2016-04-21 Thread Oscar Benjamin
On 21 April 2016 at 04:07, Steven D'Aprano  wrote:
> I want to group repeated items in a sequence. For example, I can group
> repeated sequences of a single item at a time using groupby:
>
>
> from itertools import groupby
> for key, group in groupby("BBCDDEEE"):
> group = list(group)
> print(key, "count =", len(group))
>
>
> outputs:
>
> A count = 4
> B count = 2
> C count = 1
> D count = 2
> E count = 3
> F count = 4
>
>
> Now I want to group subsequences. For example, I have:
>
> "ABCABCABCDEABCDEFABCABCABCB"
>
> and I want to group it into repeating subsequences. I can see two ways to
> group it:
>
> ABC ABC ABCDE ABCDE F ABC ABC ABC B

There are some algorithms (helpfully shown in Python) here:

https://en.wikipedia.org/wiki/Cycle_detection

Note that those are for a sequence made as x[n+1] = f(x[n]) for some
function f. In your case that's just the function that gets the next
frame up/down in the call stack.

--
Oscar

--
Oscar
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Detecting repeated subsequences of identical items

2016-04-21 Thread Steven D'Aprano
On Thursday 21 April 2016 16:35, Michael Selik wrote:

> On Wed, Apr 20, 2016 at 11:11 PM Steven D'Aprano 
> wrote:
> 
>> I want to group [repeated] subsequences. For example, I have:
>> "ABCABCABCDEABCDEFABCABCABCB"
>> and I want to group it into repeating subsequences. I can see two
>> ways... How can I do this? Does this problem have a standard name and/or
>> solution?
>>
> 
> I'm not aware of a standard name. This sounds like an unsupervised
> learning problem. There's no objectively correct answer unless you add
> more specificity to the problem statement.
> 
> Regexes may sound tempting at first, 

Ah, I may have mislead you all. I cannot use regexes, since the *actual* 
sequences I'm working on are sequences (lists, tuples, etc) or iterators of 
arbitrary items. The items themselves may be strings, but the sequence 
itself is definitely not a string. I just showed a string for convenience. 
Sorry about that.

So no regexes.


> but because a repeating subsequence
> may have nested repeating subsequences and this can go on infinitely, I
> think we at least need a push-down automata.

Fortunately, for my *immediate* problem, I would be good with some fairly 
arbitrary restrictions on subsequence detection.


> I checked out some links for clustering algorithms that work on series
> subsequences and I found some fun results.
> 
> Clustering is meaningless!
> http://www.cs.ucr.edu/~eamonn/meaningless.pdf
> 
> I think you're in "no free lunch" territory. "Clustering of subsequence
> time series remains an open issue in time series clustering"
> http://www.hindawi.com/journals/tswj/2014/312521/
> 
> Any more detail on the problem to add constraints?

The specific problem I am trying to solve is that I have a sequence of 
strings (in this case, error messages from a Python traceback) and I'm 
looking for repeated groups that may indicate mutually recursive calls. E.g. 
suppose I have a function f which calls g, and g calls h, and h calls f 
again, and there's an exception, you will see a traceback in part:


  File "", line 2, in f
  File "", line 5, in g
  File "", line 9, in h
  File "", line 2, in f
  File "", line 5, in g
  File "", line 9, in h
  File "", line 2, in f
  File "", line 5, in g
  File "", line 9, in h
  File "", line 7, in f
  File "", line 5, in g
  File "", line 9, in h

etc. Note that I only care about lines which are identical, e.g. if the line 
numbers differ (as in the last call to f), they will be treated as distinct 
items. So I'd like to group the above as:

  File "", line 2, in f
  File "", line 5, in g
  File "", line 9, in h
  *** above 3 calls repeated 3 times ***
  File "", line 7, in f
  File "", line 5, in g
  File "", line 9, in h



-- 
Steve

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Detecting repeated subsequences of identical items

2016-04-21 Thread Alain Ketterlin
Steven D'Aprano  writes:

> I want to group repeated items in a sequence. For example, I can group
> repeated sequences of a single item at a time using groupby:
[...]
> Now I want to group subsequences. For example, I have:
>
> "ABCABCABCDEABCDEFABCABCABCB"
>
> and I want to group it into repeating subsequences. I can see two ways to
> group it:
>
> ABC ABC ABCDE ABCDE F ABC ABC ABC B
[...]
> or:
>
> ABC ABC ABC D E A B C D E F ABC ABC ABC B
[...]
> How can I do this? Does this problem have a standard name and/or solution?

Looks like a tough one. I don't think it has a name (I'm not even sure
to be able to formally define it). Lets say it is an instance of
"longest repeating substring"
(https://en.wikipedia.org/wiki/Longest_repeated_substring_problem --
which really does not say much).

Anyway, it looks like a job for a suffix trees.

Depending on what you are after, you may also be interested in the
sequitur algorithm (http://www.sequitur.info/).

-- Alain.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Detecting repeated subsequences of identical items

2016-04-21 Thread Michael Selik
On Thu, Apr 21, 2016 at 2:55 AM Vlastimil Brom 
wrote:

> 2016-04-21 5:07 GMT+02:00 Steven D'Aprano :
> > I want to group subsequences.
> > "ABCABCABCDEABCDEFABCABCABCB"
> > ABC ABC ABCDE ABCDE F ABC ABC ABC B
> > or:
> > ABC ABC ABC D E A B C D E F ABC ABC ABC B
>
> if I am not missing something, the latter form of grouping might be
> achieved with the following regex: [snip]
> The former one seems to be more tricky...
>

Right. If the problem is constrained to say that repeated subsequences can
have no nested repeated subsequences, it's much easier to solve.

If you had "ABCABCABCABC" should that result in
ABC ABC ABC ABC, with 4 repetitions
or ABCABC ABCABC with 2 repetitions?
In this example, one might say the higher count is obviously better, but I
think it depends on the context. Maybe the user is looking for the biggest
patterns rather than the biggest counts.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Detecting repeated subsequences of identical items

2016-04-20 Thread Vlastimil Brom
2016-04-21 5:07 GMT+02:00 Steven D'Aprano :
> I want to group repeated items in a sequence. For example, I can group
> repeated sequences of a single item at a time using groupby:
>
>
> from itertools import groupby
> for key, group in groupby("BBCDDEEE"):
> group = list(group)
> print(key, "count =", len(group))
>
>
> outputs:
>
> A count = 4
> B count = 2
> C count = 1
> D count = 2
> E count = 3
> F count = 4
>
>
> Now I want to group subsequences. For example, I have:
>
> "ABCABCABCDEABCDEFABCABCABCB"
>
> and I want to group it into repeating subsequences. I can see two ways to
> group it:
>
> ABC ABC ABCDE ABCDE F ABC ABC ABC B
>
> giving counts:
>
> (ABC) count = 2
> (ABCDE) count = 2
> F count = 1
> (ABC) count = 3
> B repeats 1 time
>
>
> or:
>
> ABC ABC ABC D E A B C D E F ABC ABC ABC B
>
> giving counts:
>
> (ABC) count = 3
> D count = 1
> E count = 1
> A count = 1
> B count = 1
> C count = 1
> D count = 1
> E count = 1
> F count = 1
> (ABC) count = 3
> B count = 1
>
>
>
> How can I do this? Does this problem have a standard name and/or solution?
>
>
>
>
> --
> Steven
>
> --
> https://mail.python.org/mailman/listinfo/python-list

Hi,
if I am not missing something, the latter form of grouping might be
achieved with the following regex:

t="ABCABCABCDEABCDEFABCABCABCB"
grouped = re.findall(r"((?:(\w+?)\2+)|\w+?)", t)
print(grouped)
for grp, subseq in grouped:
if subseq:
print(subseq, grp.count(subseq))
else:
print(grp, "1")


the printed output is:

[('ABCABCABC', 'ABC'), ('D', ''), ('E', ''), ('A', ''), ('B', ''),
('C', ''), ('D', ''), ('E', ''), ('F', ''), ('ABCABCABC', 'ABC'),
('B', '')]
ABC 3
D 1
E 1
A 1
B 1
C 1
D 1
E 1
F 1
ABC 3
B 1

The former one seems to be more tricky...

hth,
   vbr
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Detecting repeated subsequences of identical items

2016-04-20 Thread Michael Selik
On Thu, Apr 21, 2016 at 2:35 AM Michael Selik 
wrote:

> On Wed, Apr 20, 2016 at 11:11 PM Steven D'Aprano 
> wrote:
>
>> I want to group [repeated] subsequences. For example, I have:
>> "ABCABCABCDEABCDEFABCABCABCB"
>> and I want to group it into repeating subsequences. I can see two
>> ways... How can I do this? Does this problem have a standard name and/or
>> solution?
>>
>
> I'm not aware of a standard name. This sounds like an unsupervised
> learning problem. There's no objectively correct answer unless you add more
> specificity to the problem statement.
>
> Regexes may sound tempting at first, but because a repeating subsequence
> may have nested repeating subsequences and this can go on infinitely, I
> think we at least need a push-down automata.
>
> I checked out some links for clustering algorithms that work on series
> subsequences and I found some fun results.
>
> Clustering is meaningless!
> http://www.cs.ucr.edu/~eamonn/meaningless.pdf
>
> I think you're in "no free lunch" territory. "Clustering of subsequence
> time series remains an open issue in time series clustering"
> http://www.hindawi.com/journals/tswj/2014/312521/
>
> Any more detail on the problem to add constraints?
>

Some light reading suggests that you can improve your problem by defining a
minimum size for a subsequence to qualify. One paper suggests calling these
more interesting repetitions a "motif" to use a music metaphor. Looking for
any repetitions results in too many trivial results. Is that valid for your
usage?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Detecting repeated subsequences of identical items

2016-04-20 Thread Michael Selik
On Wed, Apr 20, 2016 at 11:11 PM Steven D'Aprano 
wrote:

> I want to group [repeated] subsequences. For example, I have:
> "ABCABCABCDEABCDEFABCABCABCB"
> and I want to group it into repeating subsequences. I can see two
> ways... How can I do this? Does this problem have a standard name and/or
> solution?
>

I'm not aware of a standard name. This sounds like an unsupervised learning
problem. There's no objectively correct answer unless you add more
specificity to the problem statement.

Regexes may sound tempting at first, but because a repeating subsequence
may have nested repeating subsequences and this can go on infinitely, I
think we at least need a push-down automata.

I checked out some links for clustering algorithms that work on series
subsequences and I found some fun results.

Clustering is meaningless!
http://www.cs.ucr.edu/~eamonn/meaningless.pdf

I think you're in "no free lunch" territory. "Clustering of subsequence
time series remains an open issue in time series clustering"
http://www.hindawi.com/journals/tswj/2014/312521/

Any more detail on the problem to add constraints?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Detecting repeated subsequences of identical items

2016-04-20 Thread Chris Angelico
On Thu, Apr 21, 2016 at 1:07 PM, Steven D'Aprano  wrote:
> Now I want to group subsequences. For example, I have:
>
> "ABCABCABCDEABCDEFABCABCABCB"
>
> and I want to group it into repeating subsequences. I can see two ways to
> group it:
>
> ABC ABC ABCDE ABCDE F ABC ABC ABC B
>
> or:
>
> ABC ABC ABC D E A B C D E F ABC ABC ABC B

Interesting. I've *almost* managed to (ab)use re.split for this
purpose. A one-step solution can be done with re.match:

>>> txt = "ABCABCABCDEABCDEFABCABCABCB"
>>> re.match(r'(.+)\1+', txt)
<_sre.SRE_Match object; span=(0, 9), match='ABCABCABC'>

But split then returns only the grouped part:

>>> re.split(r'(.+)\1+', txt)
['', 'ABC', 'DEABCDEF', 'ABC', 'B']

or *all* the grouped parts:

>>> re.split(r'((.+)\2+)', txt)
['', 'ABCABCABC', 'ABC', 'DEABCDEF', 'ABCABCABC', 'ABC', 'B']

There's definitely a partial solution happening here, but I can't
quite make it work.

And no, I don't know if there's a standard name for it.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Detecting repeated subsequences of identical items

2016-04-20 Thread Ethan Furman

On 04/20/2016 08:57 PM, Ethan Furman wrote:

> [snip same pattern as Steven wrote]

Nevermind.  It's obviously time for me to go to bed.  :/

--
~Ethan~
--
https://mail.python.org/mailman/listinfo/python-list


Re: Detecting repeated subsequences of identical items

2016-04-20 Thread Ethan Furman

On 04/20/2016 08:07 PM, Steven D'Aprano wrote:


Now I want to group subsequences. For example, I have:

"ABCABCABCDEABCDEFABCABCABCB"

and I want to group it into repeating subsequences. I can see two ways to
group it:

ABC ABC ABCDE ABCDE F ABC ABC ABC B

giving counts:

(ABC) count = 2
(ABCDE) count = 2
F count = 1
(ABC) count = 3
B repeats 1 time


or

ABC ABC ABC D E A B C D E F ABC ABC B

--
~Ethan~
--
https://mail.python.org/mailman/listinfo/python-list