[ 
https://issues.apache.org/jira/browse/LUCENE-4078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284080#comment-13284080
 ] 

Hoss Man edited comment on LUCENE-4078 at 5/26/12 10:27 PM:
------------------------------------------------------------

bq. When you think of it '|' is not an operator ...

I'm not really following you there ... '|' is the OR operator, so the regex "|" 
is a redundant way of saying "" which is "the empty pattern" or a way of saying 
"match the empty string".  

Consider in particular when the regex is used for a replace or split type 
operation: "z|" means "match z or the empty string"...

{code}
$ perl -MData::Dumper -le 'print Dumper split /z|/, "ABzCD";'
$VAR1 = 'A';
$VAR2 = 'B';
$VAR3 = 'C';
$VAR4 = 'D';
{code}

You should see similar results in java with if you use Matcher.find() as an 
iterator.

As for the original pattern: "]|" -- that's just a convince form of (imposible 
to write this in jira markup w/o code tags)...
{code}\]|{code}
 .. an unescaped close bracket (that has no matching open bracket) is treated 
as a literal...

{code}
$ perl -MData::Dumper -le 'print Dumper split /]|/, "AB]CD";'
$VAR1 = 'A';
$VAR2 = 'B';
$VAR3 = 'C';
$VAR4 = 'D';
{code}

What i can't explain, is why java treats "empty string" as something that 
matches in the middle of a code point.  that certainly sounds like bug, unless 
there is some subtlety in Unicode TR#18 that i'm not seeing...

http://www.unicode.org/reports/tr18/
                
      was (Author: hossman):
    bq. When you think of it '|' is not an operator ...

I'm not really following you there ... '|' is the OR operator, so the regex "|" 
is a redundant way of saying "" which is "the empty pattern" or a way of saying 
"match the empty string".  

Consider in particular when the regex is used for a replace or split type 
operation: "z|" means "match z or the empty string"...

{code}
$ perl -MData::Dumper -le 'print Dumper split /z|/, "ABzCD";'
$VAR1 = 'A';
$VAR2 = 'B';
$VAR3 = 'C';
$VAR4 = 'D';
{code}

You should see similar results in java with if you use Matcher.find() as an 
iterator.

As for the original pattern: "]|" -- that's just a convince form of "\]|" .. an 
unescaped close bracket (that has no matching open bracket) is treated as a 
literal...

{code}
$ perl -MData::Dumper -le 'print Dumper split /]|/, "AB]CD";'
$VAR1 = 'A';
$VAR2 = 'B';
$VAR3 = 'C';
$VAR4 = 'D';
{code}

What i can't explain, is why java treats "empty string" as something that 
matches in the middle of a code point.  that certainly sounds like bug, unless 
there is some subtlety in Unicode TR#18 that i'm not seeing...

http://www.unicode.org/reports/tr18/
                  
> PatternReplaceCharFilter assertion error
> ----------------------------------------
>
>                 Key: LUCENE-4078
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4078
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>             Fix For: 4.0
>
>
> Build: https://builds.apache.org/job/Lucene-trunk/1942/
> 1 tests failed.
> REGRESSION:  
> org.apache.lucene.analysis.pattern.TestPatternReplaceCharFilter.testRandomStrings
> Error Message:
> Stack Trace:
> java.lang.AssertionError
>        at 
> __randomizedtesting.SeedInfo.seed([8E91A6AC395FEED9:618A6129A5BB9EC]:0)
>        at 
> org.apache.lucene.analysis.MockTokenizer.readCodePoint(MockTokenizer.java:153)
>        at 
> org.apache.lucene.analysis.MockTokenizer.incrementToken(MockTokenizer.java:123)
>        at 
> org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:558)
>        at 
> org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:488)
>        at 
> org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:430)
>        at 
> org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:424)
>        at 
> org.apache.lucene.analysis.pattern.TestPatternReplaceCharFilter.testRandomStrings(TestPatternReplaceCharFilter.java:323)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>        at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>        at java.lang.reflect.Method.invoke(Method.java:616)
>        at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1969)
>        at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.access$1100(RandomizedRunner.java:132)
>        at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:814)
>        at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:875)
>        at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:889)
>        at 
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50)
>        at 
> org.apache.lucene.util.TestRuleFieldCacheSanity$1.evaluate(TestRuleFieldCacheSanity.java:32)
>        at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
>        at 
> com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55)
>        at 
> org.apache.lucene.util.TestRuleReportUncaughtExceptions$1.evaluate(TestRuleReportUncaughtExceptions.java:68)
>        at 
> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
>        at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
>        at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(Randomized

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to