subject:"\[jira\] \[Commented\] \(LUCENE\-7465\) Add a PatternTokenizer that uses Lucene's RegExp implementation"

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2017-02-21 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876188#comment-15876188
 ] 

Michael McCandless commented on LUCENE-7465:


That test failure was actually a real bug in both {{SimplePatternTokenizer}} 
and {{SimpleSplitPatternTokenizer}}!  Yay for {{TestRandomChains}} ;)  I pushed 
a fix.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2017-02-21 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876185#comment-15876185
 ] 

ASF subversion and git services commented on LUCENE-7465:
-

Commit c3028b32207b8837cdaf29918edd4e0cdc9621ad in lucene-solr's branch 
refs/heads/branch_6x from Mike McCandless
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c3028b3 ]

LUCENE-7465: fix corner case in SimplePattern/SplitTokenizer when lookahead 
hits end of input


> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2017-02-21 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876181#comment-15876181
 ] 

ASF subversion and git services commented on LUCENE-7465:
-

Commit 2d03aa21a2b674d36e201f6309e646f37771b73b in lucene-solr's branch 
refs/heads/master from Mike McCandless
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2d03aa2 ]

LUCENE-7465: fix corner case in SimplePattern/SplitTokenizer when lookahead 
hits end of input


> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2017-02-21 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875772#comment-15875772
 ] 

Michael McCandless commented on LUCENE-7465:


Thanks [~steve_rowe]; I'll have a look.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2017-02-20 Thread Steve Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875092#comment-15875092
 ] 

Steve Rowe commented on LUCENE-7465:


Another reproducing TestRandomChains master seed, from 
[https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/19011/]:

{noformat}
   [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
   [junit4]   2> TEST FAIL: useCharFilter=false text='Sy 
\ud98b\udc04\uff52\u0384\u942fP\u040a\u0004\u0455 |uh)a)mrB- '
   [junit4]   2> Exception from random analyzer: 
   [junit4]   2> charfilters=
   [junit4]   2>   
org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(java.io.StringReader@29890127,
 [])
   [junit4]   2> tokenizer=
   [junit4]   2>   
org.apache.lucene.analysis.pattern.SimplePatternTokenizer(org.apache.lucene.util.automaton.Automaton@9bd88db)
   [junit4]   2> filters=
   [junit4]   2> offsetsAreCorrect=true
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains 
-Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=821B9B2715E2264F 
-Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=el-GR 
-Dtests.timezone=America/Lima -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
   [junit4] FAILURE 0.15s J2 | 
TestRandomChains.testRandomChainsWithLargeStrings <<<
   [junit4]> Throwable #1: java.lang.AssertionError: finalOffset 
expected:<24> but was:<23>
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([821B9B2715E2264F:E84024364CAC06BC]:0)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:540)
   [junit4]>at 
org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:880)
   [junit4]>at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: test params are: codec=CheapBastard, 
sim=RandomSimilarity(queryNorm=true): {}, locale=el-GR, timezone=America/Lima
   [junit4]   2> NOTE: Linux 4.4.0-53-generic amd64/Oracle Corporation 
1.8.0_121 (64-bit)/cpus=12,threads=1,free=457314736,total=508952576
{noformat}

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2017-02-15 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867599#comment-15867599
 ] 

Michael McCandless commented on LUCENE-7465:


OK I pushed a fix... sneaky wrong random instance usage.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2017-02-15 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867598#comment-15867598
 ] 

ASF subversion and git services commented on LUCENE-7465:
-

Commit 74f208d716ef25aaead97e674c5296f2be54eb76 in lucene-solr's branch 
refs/heads/branch_6x from Mike McCandless
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=74f208d ]

LUCENE-7465: use the right random instance otheerwise we hit creepy test 
failures


> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2017-02-15 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867597#comment-15867597
 ] 

ASF subversion and git services commented on LUCENE-7465:
-

Commit 5ca3ca205234694224341331980216a07c8e518b in lucene-solr's branch 
refs/heads/master from Mike McCandless
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5ca3ca2 ]

LUCENE-7465: use the right random instance otheerwise we hit creepy test 
failures


> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2017-02-14 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866854#comment-15866854
 ] 

Michael McCandless commented on LUCENE-7465:


Thanks [~steve_rowe], I'll dig.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2017-02-14 Thread Steve Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866514#comment-15866514
 ] 

Steve Rowe commented on LUCENE-7465:


My Jenkins found a reproducing seed on master for a TestRandomChains failure 
that implicates the new tokenizer:

{noformat}
   [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
   [junit4]   2> TEST FAIL: useCharFilter=false text='puzoh 
\u6a8b\u59e2\u96aa\u85f0\u614a\u9010\u7782\u5547 
\uef27\uda09\uddd2\u9b9c\u056e\u33f0 W\udb24\udce6> 
\u2d12\u2d23\u2d05\u2d1c\u2d23 *\ud9f0\udc74\uea94\ub9c6 pev trjrbvcwb tzzntfd 
y|)]){1  gmabf'
   [junit4]   2> TEST FAIL: useCharFilter=false text='puzoh 
\u6a8b\u59e2\u96aa\u85f0\u614a\u9010\u7782\u5547 
\uef27\uda09\uddd2\u9b9c\u056e\u33f0 W\udb24\udce6> 
\u2d12\u2d23\u2d05\u2d1c\u2d23 *\ud9f0\udc74\uea94\ub9c6 pev trjrbvcwb tzzntfd 
y|)]){1  gmabf'
   [junit4]   2> TEST FAIL: useCharFilter=false text='puzoh 
\u6a8b\u59e2\u96aa\u85f0\u614a\u9010\u7782\u5547 
\uef27\uda09\uddd2\u9b9c\u056e\u33f0 W\udb24\udce6> 
\u2d12\u2d23\u2d05\u2d1c\u2d23 *\ud9f0\udc74\uea94\ub9c6 pev trjrbvcwb tzzntfd 
y|)]){1  gmabf'
   [junit4]   2> feb 14, 2017 2:13:13 P.M. 
com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
 uncaughtException
   [junit4]   2> WARNING: Uncaught exception in thread: 
Thread[Thread-17,5,TGRP-TestRandomChains]
   [junit4]   2> java.lang.AssertionError: finalOffset expected:<79> but 
was:<65>
   [junit4]   2>at 
__randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
   [junit4]   2>at org.junit.Assert.fail(Assert.java:93)
   [junit4]   2>at org.junit.Assert.failNotEquals(Assert.java:647)
   [junit4]   2>at org.junit.Assert.assertEquals(Assert.java:128)
   [junit4]   2>at org.junit.Assert.assertEquals(Assert.java:472)
   [junit4]   2>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
   [junit4]   2>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
   [junit4]   2>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
   [junit4]   2>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
   [junit4]   2>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
   [junit4]   2>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
   [junit4]   2>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)
   [junit4]   2> 
   [junit4]   2> feb 14, 2017 2:13:13 P.M. 
com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
 uncaughtException
   [junit4]   2> WARNING: Uncaught exception in thread: 
Thread[Thread-18,5,TGRP-TestRandomChains]
   [junit4]   2> java.lang.AssertionError: finalOffset expected:<79> but 
was:<65>
   [junit4]   2>at 
__randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
   [junit4]   2>at org.junit.Assert.fail(Assert.java:93)
   [junit4]   2>at org.junit.Assert.failNotEquals(Assert.java:647)
   [junit4]   2>at org.junit.Assert.assertEquals(Assert.java:128)
   [junit4]   2>at org.junit.Assert.assertEquals(Assert.java:472)
   [junit4]   2>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
   [junit4]   2>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
   [junit4]   2>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
   [junit4]   2>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
   [junit4]   2>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
   [junit4]   2>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
   [junit4]   2>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)
   [junit4]   2> 
   [junit4]   2> feb 14, 2017 2:13:13 P.M. 
com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
 uncaughtException
   [junit4]   2> WARNING: Uncaught exception in thread: 
Thread[Thread-19,5,TGRP-TestRandomChains]
   [junit4]   2> java.lang.AssertionError: finalOffset expected:<79> but 
was:<65>
   [junit4]   2>at 
__randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
   [junit4]   2>at org.junit.Assert.fail(Assert.java:93)

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2017-02-13 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864235#comment-15864235
 ] 

ASF subversion and git services commented on LUCENE-7465:
-

Commit c24e03e6bf4d09e6f31eee8192bb6c0c4b2b6d27 in lucene-solr's branch 
refs/heads/branch_6x from Mike McCandless
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c24e03e ]

LUCENE-7465: add SimplePatternTokenizer and SimpleSplitPatternTokenizer, for 
tokenization using Lucene's regexp/automaton implementation


> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2017-02-13 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864078#comment-15864078
 ] 

ASF subversion and git services commented on LUCENE-7465:
-

Commit 93fa72f77bd024aa09eef043c65c64a6524613dc in lucene-solr's branch 
refs/heads/master from Mike McCandless
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=93fa72f ]

LUCENE-7465: add SimplePatternTokenizer and SimpleSplitPatternTokenizer, for 
tokenization using Lucene's regexp/automaton implementation


> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.4
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2017-01-27 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842872#comment-15842872
 ] 

David Smiley commented on LUCENE-7465:
--

bq. (Adrien) I like the separate factory idea better, it makes it easier to 
evolve those two impls separately, eg. in the case that we decide to deprecate 
PatternTokenizer or to move it to sandbox.

I think the factory isn't going to stand in the way of either tokenizer 
evolving.  A problem with separate factories is that the name 
{{PatternTokenizerFactory}} is already an excellent name, nor does it have 
hints as to how it works.  In general I don't like polluting the namespace with 
different implementations of effectively the same thing; the first impl to show 
up grabs the best name.  The Factory provides an excellent opportunity to 
bridge these multiple implementations.

Yet alas, my arguments aren't swaying anyone so go ahead.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.4
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2017-01-27 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842694#comment-15842694
 ] 

Dawid Weiss commented on LUCENE-7465:
-

bq. I think this is interesting, but let's explore it on a future issue?

Absolutely!



> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.4
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2017-01-27 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842611#comment-15842611
 ] 

Michael McCandless commented on LUCENE-7465:


Whoa, this issue almost dropped past the event horizon on my TODO list!  I'll 
revive the patch and push soon ...

bq. I think it'd be more interesting to actually write a (simple!) matcher on 
top of a non-determinized Automaton

I think this is interesting, but let's explore it on a future issue?

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.4
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2016-10-05 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547870#comment-15547870
 ] 

Dawid Weiss commented on LUCENE-7465:
-

bq. Maybe we should explore an re2j version too.

I think it'd be more interesting to actually write a (simple!) matcher on top 
of a non-determinized {{Automaton}}... Sure, it wouldn't be able to protect 
against an explosion of states at compile-time, but it'd still be possible to 
protect against it at runtime (if too many states need to be tracked within the 
automaton, we could throw a matching exception). Note that a non-deterministic 
automaton for the above regular expression is actually pretty simple!

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.3
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2016-10-04 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546776#comment-15546776
 ] 

Michael McCandless commented on LUCENE-7465:


Thank you for the example [~dweiss].  Indeed that's a hard regexp to 
determinize.  It's interesting because the determinization requires many 
states, yet it minimizes to an apparently contained number of states (though 
many transitions).

E.g. at 30 clauses, determized form produced 7652 states and 136898 
transitions, but after minimize that drops to 150 states and 2960 transitions.  
I tried to run {{dot}} on this FSA but it struggles :)

Net/net the DFA approach is not usable in some cases (like this one); such 
users must use the JDK implementation.  Maybe we should explore an {{re2j}} 
version too.

bq. Btw. if you're looking into this again, piggyback a change to 
Operations.determinize and replace LinkedList with an ArrayDeque, it certainly 
won't hurt.

Excellent, I'll fold that in!

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.3
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2016-10-04 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15544822#comment-15544822
 ] 

Dawid Weiss commented on LUCENE-7465:
-

On a happier note, if it's just a union of fixed-strings (a fsa, effectively) 
you're matching against then it's much much faster with Lucene (and Brics), of 
course (times in ms.):
{code}
JavaRegExpMatcher samples: 10 time: 11026
JavaRegExpMatcher samples: 10 time: 11046
JavaRegExpMatcher samples: 10 time: 11036
LuceneRegExpMatcher samples: 10 time: 19
LuceneRegExpMatcher samples: 10 time: 19
LuceneRegExpMatcher samples: 10 time: 18
{code}

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.3
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2016-10-04 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15544750#comment-15544750
 ] 

Dawid Weiss commented on LUCENE-7465:
-

Hi Mike. Sorry it took me so long. So, check out this example snippet:
{code}
  public static void main(String[] args) {
String [] clauses = 
"(.*mervi)|(.*hectic)|(petrographic)|(terracing.*)|(3\\.65.*)|(.*mea.*)|(.*n0)|(researchbas)|(chamfer.*)|(.*danaher)|(.*immediacy)|(.*selec)|(.*transi)|(.*photoreaction)|(ceo2)|(asif)|(.*koo.*)|(lasso)|(allis)|(.*paleodata.*)|(needs.*)|(auser)|(micropterus.*)|(.*sdw)|(.*blp.*)|(cent)|(hybridoma)|(tai.*)|(ransac)|(.*gfptag)|(.*falt.*)|(tubular)|(.*closet.*)|(.*halted.*)|(plish.*)|(.*aauw)|(satisf)|(.*kolodn)|(.*glycidyl.*)|(phytodetritu.*)|(.*2r)|(.*remodeler)|(astronomi)|(.*maienschein)|(universityof)|(event\\(s)|(exacerbation)|(leidi.*)|(stemmer.*)|(.*arrow)|(.*domestic)|(.*maq.*)|(pluggable.*)|(scheiner.*)|(interpenetrate)|(.*diving)|(superscript.*)|(.*cherry.*)|(saddlepoint)|(pyrolit.*)|(prosser)|(nyberg)|(iceberg.*)|(.*hammer.*)|(india.*)|(fsa)|(.*x\\(u.*)|(klima)|(good.*)|(.*provid)|(.*streaked)|(.*oppenheimer.*)|(loyalty.*)|(.*caspi.*)|(.*,99)|(.*unaccompanied)|(subharmon)|(.*hillis.*)|(ferment)|(olli)|(.*storybook)|(1358)|(.*savi.*)|(contagion)|(.*freeness)|(.*500m)|(brudvig.*)|(.*genemark)|(.*jahren.*)|(aguirr)|(12345.*)|(.*prolic)|(seafood.*)|(.*remedy)|(.*mildred.*)|(.*bering.*)|(monolithically.*)|(disequilibrium)".split("\\|");
for (int i = 1; i < clauses.length; i++) {
  String re = Arrays.stream(clauses) 
  .limit(i)
  .collect(Collectors.joining("|"));
  RegExp regExp = new RegExp(re);
  Automaton automaton = regExp.toAutomaton(1);
  System.out.println("Clauses: " + i + ", states=" + 
automaton.getNumStates());
}
  }
{code}

As you can see it's essentially a "prefix/suffix/exact" match. Unfortunately 
this is a very bad example to determinize, so I can't even sensibly benchmark 
it against other implementations (there can be hundreds or thousands of such 
clauses). But even this short snippet shows the severe penalty full 
determinization incurrs -- try to run it and you'll see.

Btw. if you're looking into this again, piggyback a change to 
{{Operations.determinize}} and replace {{LinkedList}} with an {{ArrayDeque}}, 
it certainly won't hurt.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.3
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2016-10-03 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543441#comment-15543441
 ] 

Adrien Grand commented on LUCENE-7465:
--

I like the separate factory idea better, it makes it easier to evolve those two 
impls separately, eg. in the case that we decide to deprecate PatternTokenizer 
or to move it to sandbox.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.3
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2016-09-27 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15525368#comment-15525368
 ] 

Dawid Weiss commented on LUCENE-7465:
-

These regexps are generated from the data, so not so easy :) And the data (and 
the regexps) can contain Unicode characters as well. I'll go back to this, time 
permitting. I'm not saying the patch is wrong, just that j.u.Pattern was pretty 
darn fast, even for large-scale patterns (and inputs). I was in particular 
surprised at re2 (C implementation) performance being way lower than Java's. Of 
course there were no adversarial cases in the input.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.3
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2016-09-26 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523315#comment-15523315
 ] 

Michael McCandless commented on LUCENE-7465:


bq.  default to current java regexp impl;

But I think that would mean this new impl would very rarely be used.

I think it's better to give it a separate factory so it has more visibility?  
If it really does work better for users over time, word will spread, new blog 
posts/docs written, etc.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.3
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2016-09-26 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523032#comment-15523032
 ] 

David Smiley commented on LUCENE-7465:
--

bq. I agree this would be nice, but my worry about taking that approach is 
which one we default to? Maybe if we make it a required param? But then how to 
implement back compat?

Not a required param; default to current java regexp impl; no back-compat 
worry.  It's the most flexible and so I think makes the best default.

bq. I think such auto-detection (looking at the user's pattern and picking the 
engine) is dangerous.

Ok.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.3
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2016-09-26 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523014#comment-15523014
 ] 

Michael McCandless commented on LUCENE-7465:


Maybe you could share just the regexp :)

But, if you do repeat the test w/ Lucene, then try to use this patch if 
possible (just the {{XXXRunAutomaton}} changes), because it optimizes for code 
points < 256.  Or, if your data is all non-ascii, then don't bother using this 
patch :)

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.3
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2016-09-26 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523000#comment-15523000
 ] 

Dawid Weiss commented on LUCENE-7465:
-

I'll try to repeat the experiment with Lucene's regexp when I have a spare 
moment. The benchmarks (or rather: data) cannot be shared, unfortunately, but 
it involved regexps with hundreds of alternatives and globs. Definitely not 
something that people can edit by hand.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.3
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2016-09-26 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522997#comment-15522997
 ] 

Dawid Weiss commented on LUCENE-7465:
-

I'll try to repeat the experiment with Lucene's regexp when I have a spare 
moment. The benchmarks (or rather: data) cannot be shared, unfortunately, but 
it involved regexps with hundreds of alternatives and globs. Definitely not 
something that people can edit by hand.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.3
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2016-09-26 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522992#comment-15522992
 ] 

Michael McCandless commented on LUCENE-7465:


bq. Instead of adding another factory, what about adding an implementation hint 
parameter to PatternTokenizerFactory? e.g. method="lucene" or method="simple". 

I agree this would be nice, but my worry about taking that approach is which 
one we default to?  Maybe if we make it a required param?  But then how to 
implement back compat?

bq. Then I wonder if we might detect circumstances in which this new 
implementation is preferable? 

I think such auto-detection (looking at the user's pattern and picking the 
engine) is a dangerous.  Maybe a user is debugging a tricky regexp, and adding 
one new character causes us to pick a different engine or something.  I think 
for now it should be a conscious choice?

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.3
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2016-09-26 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522972#comment-15522972
 ] 

Michael McCandless commented on LUCENE-7465:


[~dawid.weiss] is this a benchmark I could try to run?  My regexp was 
admittedly trivial so it would be nice to have a beefier real-world regexp to 
play with ;)

The bench is also trivial (I pushed it to luceneutil).

When you tested dk.brics, did you call the {{RunAutomaton.setAlphabet}}?  This 
should be a biggish speedup, especially if your regexp has many unique 
character start/end ranges.  In Lucene's fork of dk.briks we automatically do 
that in the utf8 case, and with this patch, also for the first 256 unicode 
characters in the full character case.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.3
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2016-09-26 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522947#comment-15522947
 ] 

David Smiley commented on LUCENE-7465:
--

Instead of adding another factory, what about adding an implementation hint 
parameter to PatternTokenizerFactory?  e.g. {{method="lucene"}} or 
{{method="simple"}}.  Then I wonder if we might detect circumstances in which 
this new implementation is preferable?  The motivation for one factory is 
similar to the WhitespaceTokenizer "rule" param.  People know they have a 
regexp and want to tokenize on it... they could easily overlook a different 
factory than the one that is already named well and may appear in 
blogs/docs/examples.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.3
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

2016-09-26 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522355#comment-15522355
 ] 

Dawid Weiss commented on LUCENE-7465:
-

Interesting that it's faster than PatternTokenizer! I haven't looked at the 
patch, Mike, I did some experiments recently with regexp benchmarking (for our 
internal needs) and fairly large regular expression patterns (over an even 
larger inputs). The native java pattern implementation always won by a large 
(and I mean: super large) margin over anything else I tried. I tried brics, re2 
(java port), re2 (native implementation), Apache ORO (out of curiosity only, it 
didn't pass correctness tests for me). Brics wasn't too bad, but the gain from 
early detection of "too hard" DFA expressions was overshadowed by DFA expansion 
(very large automata in our case), so unless you don't have control over the 
patterns (in which case adversarial cases can be executed), it didn't make 
sense for me to switch.

Also, the fact that the java implementation was fast was quite surprising to me 
as we had a large number of alternatives in regular expressions and I thought 
these would nicely yield to automaton optimizations (pullup of prefix matching, 
etc.). In the end, it didn't seem to matter. So perhaps the performance is a 
factor of how complex the regular expressions are (and how they're 
benchmarked)? Don't know.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---
>
> Key: LUCENE-7465
> URL: https://issues.apache.org/jira/browse/LUCENE-7465
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.3
>
> Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that 
> uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp 
> is attempted the user discovers it up front instead of later on when a 
> "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 
> characters at a time, vs the existing {{PatternTokenizer}} which currently 
> reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and 
> improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps 
> don't yet implement sub group capture.  I think we could add that at some 
> point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we 
> did that we should maybe name it differently 
> ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

30 matches

Site Navigation

Mail list logo

Footer information