[
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546776#comment-15546776
]
Michael McCandless commented on LUCENE-7465:
--------------------------------------------
Thank you for the example [~dweiss]. Indeed that's a hard regexp to
determinize. It's interesting because the determinization requires many
states, yet it minimizes to an apparently contained number of states (though
many transitions).
E.g. at 30 clauses, determized form produced 7652 states and 136898
transitions, but after minimize that drops to 150 states and 2960 transitions.
I tried to run {{dot}} on this FSA but it struggles :)
Net/net the DFA approach is not usable in some cases (like this one); such
users must use the JDK implementation. Maybe we should explore an {{re2j}}
version too.
bq. Btw. if you're looking into this again, piggyback a change to
Operations.determinize and replace LinkedList with an ArrayDeque, it certainly
won't hurt.
Excellent, I'll fold that in!
> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---------------------------------------------------------------
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: master (7.0), 6.3
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that
> uses Lucene's RegExp impl instead of the JDK's:
> * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp
> is attempted the user discovers it up front instead of later on when a
> "lucky" document arrives
> * It processes the incoming characters as a stream, only pulling 128
> characters at a time, vs the existing {{PatternTokenizer}} which currently
> reads the entire string up front (this has caused heap problems in the past)
> * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps
> don't yet implement sub group capture. I think we could add that at some
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we
> did that we should maybe name it differently
> ({{SimplePatternSplitTokenizer}}?).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]