[
https://issues.apache.org/jira/browse/LUCENE-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295003#comment-16295003
]
Robert Muir commented on LUCENE-8102:
-------------------------------------
The optimization speeds up leading wildcard type-queries (such as wildcard
"*foo" or regex ".*foo") which are far more common e.g. on the users list than
the example presented on this issue, so I really think we should keep it.
The regex here is quite obscene, maybe its the thing that should be optimized?
:)
> CompiledAutomaton performance for determining common suffix
> -----------------------------------------------------------
>
> Key: LUCENE-8102
> URL: https://issues.apache.org/jira/browse/LUCENE-8102
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/FSTs
> Affects Versions: 7.1
> Reporter: Thomas Poppe
>
> We're using the automaton package as part of Elasticsearch for doing regexp
> queries. Our business requires us to process rather complex regular
> expressions, for example (we have more complex examples, but this one
> illustrates the problem):
> {noformat}
> (¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,10}c(¦.)*(¦?[^¦]){1,10}d
> {noformat}
> With a large enough value of maxDeterminizedStates, this works. The problem
> we're having is that the conversion of this regular expression to a
> CompiledAutomaton takes very long. Almost all of the time goes into
> determining the common suffix for the Automaton (which is "d" in this
> example) - calculated with a call to Operations.getCommonSuffixBytesRef.
> This suffix is only used as an optimization. Skipping the calculation of
> this suffix allows us to process these kinds of queries.
> - Would it be possible to introduce a way to skip the calculation of this
> common suffix (ideally something we control from within our query to
> Elasticsearch)?
> - Or would it be possible to take a look at this getCommonSuffixBytesRef
> operation, to see if it can be optimized? Most of the time goes to
> determinizing the reversed automaton - maybe this can be avoided somehow?
> Reaction from Mike McCandless on the mailing list:
> This is just an optimization; maybe we should expose an option to disable it?
> Or maybe we can find the common suffix on an NFA instead, to avoid
> determinization?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]