[jira] [Comment Edited] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit

2021-05-29 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353908#comment-17353908
 ] 

Robert Muir edited comment on LUCENE-9981 at 5/30/21, 12:10 AM:


cc: [~dweiss] in case you have thoughts. Quick Summary if you are curious:

The problem breaks down to 2 bugs as Mike explains:
 * * the optimization to compute "common suffix" of automaton was slow, as it 
required {{commonPrefix(determinize(reverse(input)))}}. The problem of course 
is the determinize (NFA->DFA), which has that 2^N worst-case: this is an 
optimization intended for stupid cases like leading wildcard, just blast thru 
them quickly with a memcmp before doing any automata.
 * separately, the limit params on DFA->NFA conversion were ineffective, as 
they are based on number of states, not amount of work done.

Changes:
 * fix commonPrefix to work on NFA as well as DFA. This way, we don't have to 
reverse automata, and then determinize again, which is expensive.
 * separately fix limit input to determinize() to better represent amount of 
CPU work done. Because previously it had some limit based on the number of 
states, but some cases such as this can spin cpu without actually tripping that 
limit.

edit: fix crazy Jira formatting, sorry. and edit 2, again.


was (Author: rcmuir):
cc: [~dweiss] in case you have thoughts. Quick Summary if you are curious:

The problem breaks down to 2 bugs as Mike explains:
 * * the optimization to compute "common suffix" of automaton was slow, as it 
required {{commonPrefix(determinize(reverse(input)))}}. The problem of course 
is the determinize (DFA->NFA), which has that 2^N worst-case: this is an 
optimization intended for stupid cases like leading wildcard, just blast thru 
them quickly with a memcmp before doing any automata.
 * separately, the limit params on DFA->NFA conversion were ineffective, as 
they are based on number of states, not amount of work done.

Changes:
 * fix commonPrefix to work on NFA as well as DFA. This way, we don't have to 
reverse automata, and then determinize again, which is expensive.
 * separately fix limit input to determinize() to better represent amount of 
CPU work done. Because previously it had some limit based on the number of 
states, but some cases such as this can spin cpu without actually tripping that 
limit.

edit: fix crazy Jira formatting, sorry.

> CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with 
> default maxDeterminizedStates limit
> 
>
> Key: LUCENE-9981
> URL: https://issues.apache.org/jira/browse/LUCENE-9981
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, 
> LUCENE-9981_nfaprefix.patch, LUCENE-9981_test.patch, 
> three-repeats-reverse-det.png, three-repeats.png
>
>
> We have a {{maxDeterminizedStates = 1}} limit designed to keep 
> regexp-type queries from blowing up. 
> But we have an adversary that will run for 268s on my laptop before hitting 
> exception, first reported here: 
> https://github.com/opensearch-project/OpenSearch/issues/687
> When I run the test and jstack the threads, this what I see:
> {noformat}
> "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 
> os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 
> runnable  [0x7fff8b7f]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
>   at 
> org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62)
>   at 
> org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
> {noformat}
> This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be 
> an "up-front" optimization to make the actual subsequent terms-intensive part 
> of the query faster. But it makes the whole query run for nearly 5 minutes 
> before it does anything.
> So I definitely think we should improve {{getCommonSuffixBytesRef}} to be

[jira] [Comment Edited] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit

2021-05-29 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353908#comment-17353908
 ] 

Robert Muir edited comment on LUCENE-9981 at 5/30/21, 12:01 AM:


cc: [~dweiss] in case you have thoughts. Quick Summary if you are curious:

The problem breaks down to 2 bugs as Mike explains:
 * * the optimization to compute "common suffix" of automaton was slow, as it 
required {{commonPrefix(determinize(reverse(input)))}}. The problem of course 
is the determinize (DFA->NFA), which has that 2^N worst-case: this is an 
optimization intended for stupid cases like leading wildcard, just blast thru 
them quickly with a memcmp before doing any automata.
 * separately, the limit params on DFA->NFA conversion were ineffective, as 
they are based on number of states, not amount of work done.

Changes:
 * fix commonPrefix to work on NFA as well as DFA. This way, we don't have to 
reverse automata, and then determinize again, which is expensive.
 * separately fix limit input to determinize() to better represent amount of 
CPU work done. Because previously it had some limit based on the number of 
states, but some cases such as this can spin cpu without actually tripping that 
limit.

edit: fix crazy Jira formatting, sorry.


was (Author: rcmuir):
cc: [~dweiss] in case you have thoughts. Quick Summary if you are curious:

The problem breaks down to 2 bugs as Mike explains:

* * the optimization to compute "common suffix" of automaton was slow, as it 
required {{commonPrefix(determinize(reverse(input)))}}. The problem of course 
is the determinize (DFA->NFA), which has that 2^N worst-case.
* this is an optimization intended for stupid cases like leading wildcard, just 
blast thru them quickly with a memcmp before doing any automata.
* separately, the limit params on DFA->NFA conversion were ineffective, as they 
are based on number of states, not amount of work done.

Changes:
* fix commonPrefix to work on NFA as well as DFA. This way, we don't have to 
reverse automata, and then determinize again, which is expensive.
* separately fix limit input to determinize() to better represent amount of CPU 
work done. Because previously it had some limit based on the number of states, 
but some cases such as this can spin cpu without actually tripping that limit.

> CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with 
> default maxDeterminizedStates limit
> 
>
> Key: LUCENE-9981
> URL: https://issues.apache.org/jira/browse/LUCENE-9981
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, 
> LUCENE-9981_nfaprefix.patch, LUCENE-9981_test.patch, 
> three-repeats-reverse-det.png, three-repeats.png
>
>
> We have a {{maxDeterminizedStates = 1}} limit designed to keep 
> regexp-type queries from blowing up. 
> But we have an adversary that will run for 268s on my laptop before hitting 
> exception, first reported here: 
> https://github.com/opensearch-project/OpenSearch/issues/687
> When I run the test and jstack the threads, this what I see:
> {noformat}
> "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 
> os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 
> runnable  [0x7fff8b7f]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
>   at 
> org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62)
>   at 
> org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
> {noformat}
> This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be 
> an "up-front" optimization to make the actual subsequent terms-intensive part 
> of the query faster. But it makes the whole query run for nearly 5 minutes 
> before it does anything.
> So I definitely think we should improve {{getCommonSuffixBytesRef}} to be 
> more "best-effort". For example, it can reduce the lower bo

[jira] [Comment Edited] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit

2021-05-29 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353765#comment-17353765
 ] 

Michael McCandless edited comment on LUCENE-9981 at 5/29/21, 2:19 PM:
--

What a delightful {{RegExp}}! To be able to actually visualize it, I reduced 
the repeat count to three:
{noformat}
(.*a){3} {noformat}
 

Resulting in this fun automaton (determinized, and converted to operate in 
UTF-8 byte space):

!three-repeats.png!


was (Author: mikemccand):
What a delightful {{RegExp}}! To be able to actually visualize it, I reduced 
the repeat count to three:

{{    (.*a){3}}}

Resulting in this fun automaton (determinized, and converted to operate in 
UTF-8 byte space):

!three-repeats.png!

> CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with 
> default maxDeterminizedStates limit
> 
>
> Key: LUCENE-9981
> URL: https://issues.apache.org/jira/browse/LUCENE-9981
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9981_test.patch, three-repeats.png
>
>
> We have a {{maxDeterminizedStates = 1}} limit designed to keep 
> regexp-type queries from blowing up. 
> But we have an adversary that will run for 268s on my laptop before hitting 
> exception, first reported here: 
> https://github.com/opensearch-project/OpenSearch/issues/687
> When I run the test and jstack the threads, this what I see:
> {noformat}
> "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 
> os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 
> runnable  [0x7fff8b7f]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
>   at 
> org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62)
>   at 
> org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
> {noformat}
> This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be 
> an "up-front" optimization to make the actual subsequent terms-intensive part 
> of the query faster. But it makes the whole query run for nearly 5 minutes 
> before it does anything.
> So I definitely think we should improve {{getCommonSuffixBytesRef}} to be 
> more "best-effort". For example, it can reduce the lower bound to {{1000}} 
> and catch the exception like such:
> {code}
> try {
>// this is slow, and just an opto anyway, so don't burn cycles on it for 
> some crazy worst-case.
>// if we don't set this common suffix, the query will just run a bit 
> slower, that's all.
>int limit = Math.min(1000, maxDeterminizedStates);
>BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit);
>... (setting commonSuffixRef)
> } catch (TooComplexTooDeterminizeException notWorthIt) {
>   commonSuffixRef = null;
> }
> {code}
> Another, maybe simpler option, is to just check that input state/transitions 
> accounts don't exceed some low limit N.
> Basically this opto is geared at stuff like leading wildcard query of "*foo". 
> By computing that the common suffix is "foo" we can spend less CPU in the 
> terms dictionary because we can first do a memcmp before having to run data 
> thru any finite state machine. It's really a microopt and we shouldn't be 
> spending whole seconds of cpu on it, ever.
> But I still don't quite understand how the current limits are giving the 
> behavior today, maybe there is a bigger issue and I don't want to shove 
> something under the rug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org