[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876188#comment-15876188 ] Michael McCandless commented on LUCENE-7465: That test failure was actually a real bug in both {{SimplePatternTokenizer}} and {{SimpleSplitPatternTokenizer}}! Yay for {{TestRandomChains}} ;) I pushed a fix. > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.5 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876185#comment-15876185 ] ASF subversion and git services commented on LUCENE-7465: - Commit c3028b32207b8837cdaf29918edd4e0cdc9621ad in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c3028b3 ] LUCENE-7465: fix corner case in SimplePattern/SplitTokenizer when lookahead hits end of input > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.5 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876181#comment-15876181 ] ASF subversion and git services commented on LUCENE-7465: - Commit 2d03aa21a2b674d36e201f6309e646f37771b73b in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2d03aa2 ] LUCENE-7465: fix corner case in SimplePattern/SplitTokenizer when lookahead hits end of input > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.5 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875772#comment-15875772 ] Michael McCandless commented on LUCENE-7465: Thanks [~steve_rowe]; I'll have a look. > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.5 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875092#comment-15875092 ] Steve Rowe commented on LUCENE-7465: Another reproducing TestRandomChains master seed, from [https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/19011/]: {noformat} [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains [junit4] 2> TEST FAIL: useCharFilter=false text='Sy \ud98b\udc04\uff52\u0384\u942fP\u040a\u0004\u0455 |uh)a)mrB- ' [junit4] 2> Exception from random analyzer: [junit4] 2> charfilters= [junit4] 2> org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(java.io.StringReader@29890127, []) [junit4] 2> tokenizer= [junit4] 2> org.apache.lucene.analysis.pattern.SimplePatternTokenizer(org.apache.lucene.util.automaton.Automaton@9bd88db) [junit4] 2> filters= [junit4] 2> offsetsAreCorrect=true [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=821B9B2715E2264F -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=el-GR -Dtests.timezone=America/Lima -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] FAILURE 0.15s J2 | TestRandomChains.testRandomChainsWithLargeStrings <<< [junit4]> Throwable #1: java.lang.AssertionError: finalOffset expected:<24> but was:<23> [junit4]>at __randomizedtesting.SeedInfo.seed([821B9B2715E2264F:E84024364CAC06BC]:0) [junit4]>at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293) [junit4]>at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308) [junit4]>at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312) [junit4]>at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843) [junit4]>at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642) [junit4]>at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:540) [junit4]>at org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:880) [junit4]>at java.lang.Thread.run(Thread.java:745) [junit4] 2> NOTE: test params are: codec=CheapBastard, sim=RandomSimilarity(queryNorm=true): {}, locale=el-GR, timezone=America/Lima [junit4] 2> NOTE: Linux 4.4.0-53-generic amd64/Oracle Corporation 1.8.0_121 (64-bit)/cpus=12,threads=1,free=457314736,total=508952576 {noformat} > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.5 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867599#comment-15867599 ] Michael McCandless commented on LUCENE-7465: OK I pushed a fix... sneaky wrong random instance usage. > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.5 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867598#comment-15867598 ] ASF subversion and git services commented on LUCENE-7465: - Commit 74f208d716ef25aaead97e674c5296f2be54eb76 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=74f208d ] LUCENE-7465: use the right random instance otheerwise we hit creepy test failures > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.5 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867597#comment-15867597 ] ASF subversion and git services commented on LUCENE-7465: - Commit 5ca3ca205234694224341331980216a07c8e518b in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5ca3ca2 ] LUCENE-7465: use the right random instance otheerwise we hit creepy test failures > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.5 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866854#comment-15866854 ] Michael McCandless commented on LUCENE-7465: Thanks [~steve_rowe], I'll dig. > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.5 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866514#comment-15866514 ] Steve Rowe commented on LUCENE-7465: My Jenkins found a reproducing seed on master for a TestRandomChains failure that implicates the new tokenizer: {noformat} [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains [junit4] 2> TEST FAIL: useCharFilter=false text='puzoh \u6a8b\u59e2\u96aa\u85f0\u614a\u9010\u7782\u5547 \uef27\uda09\uddd2\u9b9c\u056e\u33f0 W\udb24\udce6> \u2d12\u2d23\u2d05\u2d1c\u2d23 *\ud9f0\udc74\uea94\ub9c6 pev trjrbvcwb tzzntfd y|)]){1 gmabf' [junit4] 2> TEST FAIL: useCharFilter=false text='puzoh \u6a8b\u59e2\u96aa\u85f0\u614a\u9010\u7782\u5547 \uef27\uda09\uddd2\u9b9c\u056e\u33f0 W\udb24\udce6> \u2d12\u2d23\u2d05\u2d1c\u2d23 *\ud9f0\udc74\uea94\ub9c6 pev trjrbvcwb tzzntfd y|)]){1 gmabf' [junit4] 2> TEST FAIL: useCharFilter=false text='puzoh \u6a8b\u59e2\u96aa\u85f0\u614a\u9010\u7782\u5547 \uef27\uda09\uddd2\u9b9c\u056e\u33f0 W\udb24\udce6> \u2d12\u2d23\u2d05\u2d1c\u2d23 *\ud9f0\udc74\uea94\ub9c6 pev trjrbvcwb tzzntfd y|)]){1 gmabf' [junit4] 2> feb 14, 2017 2:13:13 P.M. com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException [junit4] 2> WARNING: Uncaught exception in thread: Thread[Thread-17,5,TGRP-TestRandomChains] [junit4] 2> java.lang.AssertionError: finalOffset expected:<79> but was:<65> [junit4] 2>at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0) [junit4] 2>at org.junit.Assert.fail(Assert.java:93) [junit4] 2>at org.junit.Assert.failNotEquals(Assert.java:647) [junit4] 2>at org.junit.Assert.assertEquals(Assert.java:128) [junit4] 2>at org.junit.Assert.assertEquals(Assert.java:472) [junit4] 2>at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293) [junit4] 2>at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308) [junit4] 2>at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312) [junit4] 2>at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843) [junit4] 2>at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642) [junit4] 2>at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66) [junit4] 2>at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510) [junit4] 2> [junit4] 2> feb 14, 2017 2:13:13 P.M. com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException [junit4] 2> WARNING: Uncaught exception in thread: Thread[Thread-18,5,TGRP-TestRandomChains] [junit4] 2> java.lang.AssertionError: finalOffset expected:<79> but was:<65> [junit4] 2>at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0) [junit4] 2>at org.junit.Assert.fail(Assert.java:93) [junit4] 2>at org.junit.Assert.failNotEquals(Assert.java:647) [junit4] 2>at org.junit.Assert.assertEquals(Assert.java:128) [junit4] 2>at org.junit.Assert.assertEquals(Assert.java:472) [junit4] 2>at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293) [junit4] 2>at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308) [junit4] 2>at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312) [junit4] 2>at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843) [junit4] 2>at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642) [junit4] 2>at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66) [junit4] 2>at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510) [junit4] 2> [junit4] 2> feb 14, 2017 2:13:13 P.M. com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException [junit4] 2> WARNING: Uncaught exception in thread: Thread[Thread-19,5,TGRP-TestRandomChains] [junit4] 2> java.lang.AssertionError: finalOffset expected:<79> but was:<65> [junit4] 2>at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0) [junit4] 2>at org.junit.Assert.fail(Assert.java:93)
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864235#comment-15864235 ] ASF subversion and git services commented on LUCENE-7465: - Commit c24e03e6bf4d09e6f31eee8192bb6c0c4b2b6d27 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c24e03e ] LUCENE-7465: add SimplePatternTokenizer and SimpleSplitPatternTokenizer, for tokenization using Lucene's regexp/automaton implementation > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.5 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864078#comment-15864078 ] ASF subversion and git services commented on LUCENE-7465: - Commit 93fa72f77bd024aa09eef043c65c64a6524613dc in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=93fa72f ] LUCENE-7465: add SimplePatternTokenizer and SimpleSplitPatternTokenizer, for tokenization using Lucene's regexp/automaton implementation > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842872#comment-15842872 ] David Smiley commented on LUCENE-7465: -- bq. (Adrien) I like the separate factory idea better, it makes it easier to evolve those two impls separately, eg. in the case that we decide to deprecate PatternTokenizer or to move it to sandbox. I think the factory isn't going to stand in the way of either tokenizer evolving. A problem with separate factories is that the name {{PatternTokenizerFactory}} is already an excellent name, nor does it have hints as to how it works. In general I don't like polluting the namespace with different implementations of effectively the same thing; the first impl to show up grabs the best name. The Factory provides an excellent opportunity to bridge these multiple implementations. Yet alas, my arguments aren't swaying anyone so go ahead. > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842694#comment-15842694 ] Dawid Weiss commented on LUCENE-7465: - bq. I think this is interesting, but let's explore it on a future issue? Absolutely! > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842611#comment-15842611 ] Michael McCandless commented on LUCENE-7465: Whoa, this issue almost dropped past the event horizon on my TODO list! I'll revive the patch and push soon ... bq. I think it'd be more interesting to actually write a (simple!) matcher on top of a non-determinized Automaton I think this is interesting, but let's explore it on a future issue? > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547870#comment-15547870 ] Dawid Weiss commented on LUCENE-7465: - bq. Maybe we should explore an re2j version too. I think it'd be more interesting to actually write a (simple!) matcher on top of a non-determinized {{Automaton}}... Sure, it wouldn't be able to protect against an explosion of states at compile-time, but it'd still be possible to protect against it at runtime (if too many states need to be tracked within the automaton, we could throw a matching exception). Note that a non-deterministic automaton for the above regular expression is actually pretty simple! > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546776#comment-15546776 ] Michael McCandless commented on LUCENE-7465: Thank you for the example [~dweiss]. Indeed that's a hard regexp to determinize. It's interesting because the determinization requires many states, yet it minimizes to an apparently contained number of states (though many transitions). E.g. at 30 clauses, determized form produced 7652 states and 136898 transitions, but after minimize that drops to 150 states and 2960 transitions. I tried to run {{dot}} on this FSA but it struggles :) Net/net the DFA approach is not usable in some cases (like this one); such users must use the JDK implementation. Maybe we should explore an {{re2j}} version too. bq. Btw. if you're looking into this again, piggyback a change to Operations.determinize and replace LinkedList with an ArrayDeque, it certainly won't hurt. Excellent, I'll fold that in! > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15544822#comment-15544822 ] Dawid Weiss commented on LUCENE-7465: - On a happier note, if it's just a union of fixed-strings (a fsa, effectively) you're matching against then it's much much faster with Lucene (and Brics), of course (times in ms.): {code} JavaRegExpMatcher samples: 10 time: 11026 JavaRegExpMatcher samples: 10 time: 11046 JavaRegExpMatcher samples: 10 time: 11036 LuceneRegExpMatcher samples: 10 time: 19 LuceneRegExpMatcher samples: 10 time: 19 LuceneRegExpMatcher samples: 10 time: 18 {code} > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15544750#comment-15544750 ] Dawid Weiss commented on LUCENE-7465: - Hi Mike. Sorry it took me so long. So, check out this example snippet: {code} public static void main(String[] args) { String [] clauses = "(.*mervi)|(.*hectic)|(petrographic)|(terracing.*)|(3\\.65.*)|(.*mea.*)|(.*n0)|(researchbas)|(chamfer.*)|(.*danaher)|(.*immediacy)|(.*selec)|(.*transi)|(.*photoreaction)|(ceo2)|(asif)|(.*koo.*)|(lasso)|(allis)|(.*paleodata.*)|(needs.*)|(auser)|(micropterus.*)|(.*sdw)|(.*blp.*)|(cent)|(hybridoma)|(tai.*)|(ransac)|(.*gfptag)|(.*falt.*)|(tubular)|(.*closet.*)|(.*halted.*)|(plish.*)|(.*aauw)|(satisf)|(.*kolodn)|(.*glycidyl.*)|(phytodetritu.*)|(.*2r)|(.*remodeler)|(astronomi)|(.*maienschein)|(universityof)|(event\\(s)|(exacerbation)|(leidi.*)|(stemmer.*)|(.*arrow)|(.*domestic)|(.*maq.*)|(pluggable.*)|(scheiner.*)|(interpenetrate)|(.*diving)|(superscript.*)|(.*cherry.*)|(saddlepoint)|(pyrolit.*)|(prosser)|(nyberg)|(iceberg.*)|(.*hammer.*)|(india.*)|(fsa)|(.*x\\(u.*)|(klima)|(good.*)|(.*provid)|(.*streaked)|(.*oppenheimer.*)|(loyalty.*)|(.*caspi.*)|(.*,99)|(.*unaccompanied)|(subharmon)|(.*hillis.*)|(ferment)|(olli)|(.*storybook)|(1358)|(.*savi.*)|(contagion)|(.*freeness)|(.*500m)|(brudvig.*)|(.*genemark)|(.*jahren.*)|(aguirr)|(12345.*)|(.*prolic)|(seafood.*)|(.*remedy)|(.*mildred.*)|(.*bering.*)|(monolithically.*)|(disequilibrium)".split("\\|"); for (int i = 1; i < clauses.length; i++) { String re = Arrays.stream(clauses) .limit(i) .collect(Collectors.joining("|")); RegExp regExp = new RegExp(re); Automaton automaton = regExp.toAutomaton(1); System.out.println("Clauses: " + i + ", states=" + automaton.getNumStates()); } } {code} As you can see it's essentially a "prefix/suffix/exact" match. Unfortunately this is a very bad example to determinize, so I can't even sensibly benchmark it against other implementations (there can be hundreds or thousands of such clauses). But even this short snippet shows the severe penalty full determinization incurrs -- try to run it and you'll see. Btw. if you're looking into this again, piggyback a change to {{Operations.determinize}} and replace {{LinkedList}} with an {{ArrayDeque}}, it certainly won't hurt. > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543441#comment-15543441 ] Adrien Grand commented on LUCENE-7465: -- I like the separate factory idea better, it makes it easier to evolve those two impls separately, eg. in the case that we decide to deprecate PatternTokenizer or to move it to sandbox. > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15525368#comment-15525368 ] Dawid Weiss commented on LUCENE-7465: - These regexps are generated from the data, so not so easy :) And the data (and the regexps) can contain Unicode characters as well. I'll go back to this, time permitting. I'm not saying the patch is wrong, just that j.u.Pattern was pretty darn fast, even for large-scale patterns (and inputs). I was in particular surprised at re2 (C implementation) performance being way lower than Java's. Of course there were no adversarial cases in the input. > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523315#comment-15523315 ] Michael McCandless commented on LUCENE-7465: bq. default to current java regexp impl; But I think that would mean this new impl would very rarely be used. I think it's better to give it a separate factory so it has more visibility? If it really does work better for users over time, word will spread, new blog posts/docs written, etc. > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523032#comment-15523032 ] David Smiley commented on LUCENE-7465: -- bq. I agree this would be nice, but my worry about taking that approach is which one we default to? Maybe if we make it a required param? But then how to implement back compat? Not a required param; default to current java regexp impl; no back-compat worry. It's the most flexible and so I think makes the best default. bq. I think such auto-detection (looking at the user's pattern and picking the engine) is dangerous. Ok. > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523014#comment-15523014 ] Michael McCandless commented on LUCENE-7465: Maybe you could share just the regexp :) But, if you do repeat the test w/ Lucene, then try to use this patch if possible (just the {{XXXRunAutomaton}} changes), because it optimizes for code points < 256. Or, if your data is all non-ascii, then don't bother using this patch :) > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523000#comment-15523000 ] Dawid Weiss commented on LUCENE-7465: - I'll try to repeat the experiment with Lucene's regexp when I have a spare moment. The benchmarks (or rather: data) cannot be shared, unfortunately, but it involved regexps with hundreds of alternatives and globs. Definitely not something that people can edit by hand. > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522997#comment-15522997 ] Dawid Weiss commented on LUCENE-7465: - I'll try to repeat the experiment with Lucene's regexp when I have a spare moment. The benchmarks (or rather: data) cannot be shared, unfortunately, but it involved regexps with hundreds of alternatives and globs. Definitely not something that people can edit by hand. > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522992#comment-15522992 ] Michael McCandless commented on LUCENE-7465: bq. Instead of adding another factory, what about adding an implementation hint parameter to PatternTokenizerFactory? e.g. method="lucene" or method="simple". I agree this would be nice, but my worry about taking that approach is which one we default to? Maybe if we make it a required param? But then how to implement back compat? bq. Then I wonder if we might detect circumstances in which this new implementation is preferable? I think such auto-detection (looking at the user's pattern and picking the engine) is a dangerous. Maybe a user is debugging a tricky regexp, and adding one new character causes us to pick a different engine or something. I think for now it should be a conscious choice? > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522972#comment-15522972 ] Michael McCandless commented on LUCENE-7465: [~dawid.weiss] is this a benchmark I could try to run? My regexp was admittedly trivial so it would be nice to have a beefier real-world regexp to play with ;) The bench is also trivial (I pushed it to luceneutil). When you tested dk.brics, did you call the {{RunAutomaton.setAlphabet}}? This should be a biggish speedup, especially if your regexp has many unique character start/end ranges. In Lucene's fork of dk.briks we automatically do that in the utf8 case, and with this patch, also for the first 256 unicode characters in the full character case. > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522947#comment-15522947 ] David Smiley commented on LUCENE-7465: -- Instead of adding another factory, what about adding an implementation hint parameter to PatternTokenizerFactory? e.g. {{method="lucene"}} or {{method="simple"}}. Then I wonder if we might detect circumstances in which this new implementation is preferable? The motivation for one factory is similar to the WhitespaceTokenizer "rule" param. People know they have a regexp and want to tokenize on it... they could easily overlook a different factory than the one that is already named well and may appear in blogs/docs/examples. > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation
[ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15522355#comment-15522355 ] Dawid Weiss commented on LUCENE-7465: - Interesting that it's faster than PatternTokenizer! I haven't looked at the patch, Mike, I did some experiments recently with regexp benchmarking (for our internal needs) and fairly large regular expression patterns (over an even larger inputs). The native java pattern implementation always won by a large (and I mean: super large) margin over anything else I tried. I tried brics, re2 (java port), re2 (native implementation), Apache ORO (out of curiosity only, it didn't pass correctness tests for me). Brics wasn't too bad, but the gain from early detection of "too hard" DFA expressions was overshadowed by DFA expansion (very large automata in our case), so unless you don't have control over the patterns (in which case adversarial cases can be executed), it didn't make sense for me to switch. Also, the fact that the java implementation was fast was quite surprising to me as we had a large number of alternatives in regular expressions and I thought these would nicely yield to automaton optimizations (pullup of prefix matching, etc.). In the end, it didn't seem to matter. So perhaps the performance is a factor of how complex the regular expressions are (and how they're benchmarked)? Don't know. > Add a PatternTokenizer that uses Lucene's RegExp implementation > --- > > Key: LUCENE-7465 > URL: https://issues.apache.org/jira/browse/LUCENE-7465 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7465.patch, LUCENE-7465.patch > > > I think there are some nice benefits to a version of PatternTokenizer that > uses Lucene's RegExp impl instead of the JDK's: > * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp > is attempted the user discovers it up front instead of later on when a > "lucky" document arrives > * It processes the incoming characters as a stream, only pulling 128 > characters at a time, vs the existing {{PatternTokenizer}} which currently > reads the entire string up front (this has caused heap problems in the past) > * It should be fast. > I named it {{SimplePatternTokenizer}}, and it still needs a factory and > improved tests, but I think it's otherwise close. > It currently does not take a {{group}} parameter because Lucene's RegExps > don't yet implement sub group capture. I think we could add that at some > point, but it's a bit tricky. > This doesn't even have group=-1 support (like String.split) ... I think if we > did that we should maybe name it differently > ({{SimplePatternSplitTokenizer}}?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org