[ 
https://issues.apache.org/jira/browse/LUCENE-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Poppe updated LUCENE-8102:
---------------------------------
    Description: 
We're using the automaton package as part of Elasticsearch for doing regexp 
queries.  Our business requires us to process rather complex regular 
expressions, for example (we have more complex examples, but this one 
illustrates the problem):

{noformat}
        (¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,10}c(¦.)*(¦?[^¦]){1,10}d
{noformat}

With a large enough value of maxDeterminizedStates, this works.  The problem 
we're having is that the conversion of this regular expression to a 
CompiledAutomaton takes very long.  Almost all of the time goes into 
determining the common suffix for the Automaton (which is "d" in this example) 
- calculated with a call to Operations.getCommonSuffixBytesRef.

This suffix is only used as an optimization.  Skipping the calculation of this 
suffix allows us to process these kinds of queries.

- Would it be possible to introduce a way to skip the calculation of this 
common suffix (ideally something we control from within our query to 
Elasticsearch)?
- Or would it be possible to take a look at this getCommonSuffixBytesRef 
operation, to see if it can be optimized?  Most of the time goes to 
determinizing the reversed automaton - maybe this can be avoided somehow?


Reaction from Mike McCandless on the mailing list:
This is just an optimization; maybe we should expose an option to disable it?
Or maybe we can find the common suffix on an NFA instead, to avoid 
determinization?

  was:
We're using the automaton package as part of Elasticsearch for doing regexp 
queries.  Our business requires us to process rather complex regular 
expressions, for example (we have more complex examples, but this one 
illustrates the problem):

        (¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,10}c(¦.)*(¦?[^¦]){1,10}d

With a large enough value of maxDeterminizedStates, this works.  The problem 
we're having is that the conversion of this regular expression to a 
CompiledAutomaton takes very long.  Almost all of the time goes into 
determining the common suffix for the Automaton (which is "d" in this example) 
- calculated with a call to Operations.getCommonSuffixBytesRef.

This suffix is only used as an optimization.  Skipping the calculation of this 
suffix allows us to process these kinds of queries.

- Would it be possible to introduce a way to skip the calculation of this 
common suffix (ideally something we control from within our query to 
Elasticsearch)?
- Or would it be possible to take a look at this getCommonSuffixBytesRef 
operation, to see if it can be optimized?  Most of the time goes to 
determinizing the reversed automaton - maybe this can be avoided somehow?


Reaction from Mike McCandless on the mailing list:
This is just an optimization; maybe we should expose an option to disable it?
Or maybe we can find the common suffix on an NFA instead, to avoid 
determinization?


> CompiledAutomaton performance for determining common suffix
> -----------------------------------------------------------
>
>                 Key: LUCENE-8102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8102
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/FSTs
>    Affects Versions: 7.1
>            Reporter: Thomas Poppe
>
> We're using the automaton package as part of Elasticsearch for doing regexp 
> queries.  Our business requires us to process rather complex regular 
> expressions, for example (we have more complex examples, but this one 
> illustrates the problem):
> {noformat}
>         (¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,10}c(¦.)*(¦?[^¦]){1,10}d
> {noformat}
> With a large enough value of maxDeterminizedStates, this works.  The problem 
> we're having is that the conversion of this regular expression to a 
> CompiledAutomaton takes very long.  Almost all of the time goes into 
> determining the common suffix for the Automaton (which is "d" in this 
> example) - calculated with a call to Operations.getCommonSuffixBytesRef.
> This suffix is only used as an optimization.  Skipping the calculation of 
> this suffix allows us to process these kinds of queries.
> - Would it be possible to introduce a way to skip the calculation of this 
> common suffix (ideally something we control from within our query to 
> Elasticsearch)?
> - Or would it be possible to take a look at this getCommonSuffixBytesRef 
> operation, to see if it can be optimized?  Most of the time goes to 
> determinizing the reversed automaton - maybe this can be avoided somehow?
> Reaction from Mike McCandless on the mailing list:
> This is just an optimization; maybe we should expose an option to disable it?
> Or maybe we can find the common suffix on an NFA instead, to avoid 
> determinization?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to