[ https://issues.apache.org/jira/browse/LUCENE-4078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284080#comment-13284080 ]
Hoss Man edited comment on LUCENE-4078 at 5/26/12 10:27 PM: ------------------------------------------------------------ bq. When you think of it '|' is not an operator ... I'm not really following you there ... '|' is the OR operator, so the regex "|" is a redundant way of saying "" which is "the empty pattern" or a way of saying "match the empty string". Consider in particular when the regex is used for a replace or split type operation: "z|" means "match z or the empty string"... {code} $ perl -MData::Dumper -le 'print Dumper split /z|/, "ABzCD";' $VAR1 = 'A'; $VAR2 = 'B'; $VAR3 = 'C'; $VAR4 = 'D'; {code} You should see similar results in java with if you use Matcher.find() as an iterator. As for the original pattern: "]|" -- that's just a convince form of (imposible to write this in jira markup w/o code tags)... {code}\]|{code} .. an unescaped close bracket (that has no matching open bracket) is treated as a literal... {code} $ perl -MData::Dumper -le 'print Dumper split /]|/, "AB]CD";' $VAR1 = 'A'; $VAR2 = 'B'; $VAR3 = 'C'; $VAR4 = 'D'; {code} What i can't explain, is why java treats "empty string" as something that matches in the middle of a code point. that certainly sounds like bug, unless there is some subtlety in Unicode TR#18 that i'm not seeing... http://www.unicode.org/reports/tr18/ was (Author: hossman): bq. When you think of it '|' is not an operator ... I'm not really following you there ... '|' is the OR operator, so the regex "|" is a redundant way of saying "" which is "the empty pattern" or a way of saying "match the empty string". Consider in particular when the regex is used for a replace or split type operation: "z|" means "match z or the empty string"... {code} $ perl -MData::Dumper -le 'print Dumper split /z|/, "ABzCD";' $VAR1 = 'A'; $VAR2 = 'B'; $VAR3 = 'C'; $VAR4 = 'D'; {code} You should see similar results in java with if you use Matcher.find() as an iterator. As for the original pattern: "]|" -- that's just a convince form of "\]|" .. an unescaped close bracket (that has no matching open bracket) is treated as a literal... {code} $ perl -MData::Dumper -le 'print Dumper split /]|/, "AB]CD";' $VAR1 = 'A'; $VAR2 = 'B'; $VAR3 = 'C'; $VAR4 = 'D'; {code} What i can't explain, is why java treats "empty string" as something that matches in the middle of a code point. that certainly sounds like bug, unless there is some subtlety in Unicode TR#18 that i'm not seeing... http://www.unicode.org/reports/tr18/ > PatternReplaceCharFilter assertion error > ---------------------------------------- > > Key: LUCENE-4078 > URL: https://issues.apache.org/jira/browse/LUCENE-4078 > Project: Lucene - Java > Issue Type: Bug > Reporter: Dawid Weiss > Assignee: Dawid Weiss > Priority: Minor > Fix For: 4.0 > > > Build: https://builds.apache.org/job/Lucene-trunk/1942/ > 1 tests failed. > REGRESSION: > org.apache.lucene.analysis.pattern.TestPatternReplaceCharFilter.testRandomStrings > Error Message: > Stack Trace: > java.lang.AssertionError > at > __randomizedtesting.SeedInfo.seed([8E91A6AC395FEED9:618A6129A5BB9EC]:0) > at > org.apache.lucene.analysis.MockTokenizer.readCodePoint(MockTokenizer.java:153) > at > org.apache.lucene.analysis.MockTokenizer.incrementToken(MockTokenizer.java:123) > at > org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:558) > at > org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:488) > at > org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:430) > at > org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:424) > at > org.apache.lucene.analysis.pattern.TestPatternReplaceCharFilter.testRandomStrings(TestPatternReplaceCharFilter.java:323) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:616) > at > com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1969) > at > com.carrotsearch.randomizedtesting.RandomizedRunner.access$1100(RandomizedRunner.java:132) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:814) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:875) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:889) > at > org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50) > at > org.apache.lucene.util.TestRuleFieldCacheSanity$1.evaluate(TestRuleFieldCacheSanity.java:32) > at > org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) > at > com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55) > at > org.apache.lucene.util.TestRuleReportUncaughtExceptions$1.evaluate(TestRuleReportUncaughtExceptions.java:68) > at > org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48) > at > org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48) > at > com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(Randomized -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org