[ 
https://issues.apache.org/jira/browse/LUCENE-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456106#comment-17456106
 ] 

ASF subversion and git services commented on LUCENE-10296:
----------------------------------------------------------

Commit 7a872c7a5c00d846314d44a445f8b0e83acb6a86 in lucene's branch 
refs/heads/main from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=7a872c7 ]

LUCENE-10296: Stop minimizing regepx (#528)

In current trunk, we let caller (e.g. RegExpQuery) try to "reduce" the 
expression. The parser nor the low-level executors don't implicitly call 
exponential-time algorithms anymore.

But now that we have cleaned this up, we can see it is even worse than just 
calling determinize(). We still call minimize() which is much crazier and much 
more.

We stopped doing this for all other AutomatonQuery subclasses a long time ago, 
as we determined that it didn't help performance. Additionally, minimization 
vs. determinization is even less important than early days where we found 
trouble: the representation got a lot better. Today when you finishState we do 
a lot of practical sorting/coalescing on-the-fly. Also we added this fancy 
UTF32-to-UTF8 automata convertor, that makes the worst-case-space-per-state 
significantly lower than it was before? So why minimize() ?

Let's just replace minimize() calls with determinize() calls? I've already 
swapped them out for all of src/test, to get jenkins looking for issues ahead 
of time.

This change moves hopcroft minimization (MinimizeOperations) to src/test for 
now. I'd like to explore nuking it from there as a next step, any tests that 
truly need minimization should be fine with brzozowski's
algorithm.

> Stop minimizing regexps
> -----------------------
>
>                 Key: LUCENE-10296
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10296
>             Project: Lucene - Core
>          Issue Type: Task
>    Affects Versions: 10.0 (main)
>            Reporter: Robert Muir
>            Priority: Major
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> In current trunk, we let caller (e.g. RegExpQuery) try to "reduce" the 
> expression. The parser nor the low-level executors don't implicitly call 
> exponential-time algorithms anymore.
> But now that we have cleaned this up, we can see it is even worse than just 
> calling {{{}determinize(){}}}. We still call {{minimize()}} which is much 
> crazier and much more. 
> We stopped doing this for all other AutomatonQuery subclasses a long time 
> ago, as we determined that it didn't help performance. Additionally, 
> minimization vs. determinization is even less important than early days where 
> we found trouble: the representation got a lot better. Today when you 
> {{finishState}} we do a lot of practical sorting/coalescing on-the-fly. Also 
> we added this fancy UTF32-to-UTF8 automata convertor, that makes the 
> worst-case-space-per-state significantly lower than it was before? So why 
> {{minimize()}} ?
> Let's just replace {{minimize()}} calls with {{determinize()}} calls? I've 
> already swapped them out for all of {{{}src/test{}}}, to get jenkins looking 
> for issues ahead of time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to