[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

Michael McCandless (JIRA) Tue, 04 Oct 2016 14:53:48 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546776#comment-15546776
 ]


Michael McCandless commented on LUCENE-7465:
--------------------------------------------

Thank you for the example [~dweiss].  Indeed that's a hard regexp to 
determinize.  It's interesting because the determinization requires many 
states, yet it minimizes to an apparently contained number of states (though 
many transitions).

E.g. at 30 clauses, determized form produced 7652 states and 136898 
transitions, but after minimize that drops to 150 states and 2960 transitions.  
I tried to run {{dot}} on this FSA but it struggles :)

Net/net the DFA approach is not usable in some cases (like this one); such 
users must use the JDK implementation.  Maybe we should explore an {{re2j}} 
version too.

bq. Btw. if you're looking into this again, piggyback a change to 
Operations.determinize and replace LinkedList with an ArrayDeque, it certainly 
won't hurt.

Excellent, I'll fold that in!

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---------------------------------------------------------------
>
>                 Key: LUCENE-7465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7465
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: master (7.0), 6.3
>
>         Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

Reply via email to