[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

Dawid Weiss (JIRA) Tue, 27 Sep 2016 00:34:38 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15525368#comment-15525368
 ]


Dawid Weiss commented on LUCENE-7465:
-------------------------------------

These regexps are generated from the data, so not so easy :) And the data (and 
the regexps) can contain Unicode characters as well. I'll go back to this, time 
permitting. I'm not saying the patch is wrong, just that j.u.Pattern was pretty 
darn fast, even for large-scale patterns (and inputs). I was in particular 
surprised at re2 (C implementation) performance being way lower than Java's. Of 
course there were no adversarial cases in the input.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---------------------------------------------------------------
>
>                 Key: LUCENE-7465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7465
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: master (7.0), 6.3
>
>         Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

Reply via email to