[ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14619814#comment-14619814
 ] 

Cao Manh Dat commented on LUCENE-6595:
--------------------------------------

Thanks [~mikemccand]!
{quote}
@@ -215,7 +230,8 @@
     };
     
     int numRounds = RANDOM_MULTIPLIER * 10000;
-    checkRandomData(random(), analyzer, numRounds);
+//    checkRandomData(random(), analyzer, numRounds);
+    checkAnalysisConsistency(random(),analyzer,true,"m?(y '&");
     analyzer.close();
   }
{quote}
My fault, I played around with the test and forgot to roll back. 

{quote}
It's spooky the test fails because with the right default here (hmm maybe it 
should be {code} off + cumulativeDiff {code} since it's an input offset, it 
should behave exactly has before?
{quote}
Nice idea, I changed it will {code} off - cumulativeDiff {code} and i work 
perfectly

{quote}
For the default impl for CharFilter.correctEnd should we just use 
CharFilter.correct?
Can we rename correctOffset --> correctStartOffset now that we also have a 
correctEndOffset?
{quote}
Nice refactoring.

{quote}
Does (correctOffset(endOffset-1)+1) not work? It would be nice not to add the 
new method to CharFilter (only to Tokenizer).
{quote}
I tried to do that, but it cant be. Because the information for the special 
case lie down in BaseCharFilter.

[~rcmuir] I will try to explain the solution in a slide, I'm quite not good at 
it :( 


> CharFilter offsets correction is wonky
> --------------------------------------
>
>                 Key: LUCENE-6595
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6595
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Michael McCandless
>         Attachments: LUCENE-6595.patch, LUCENE-6595.patch, LUCENE-6595.patch
>
>
> Spinoff from this original Elasticsearch issue: 
> https://github.com/elastic/elasticsearch/issues/11726
> If I make a MappingCharFilter with these mappings:
> {noformat}
>   ( -> 
>   ) -> 
> {noformat}
> i.e., just erase left and right paren, then tokenizing the string
> "(F31)" with e.g. WhitespaceTokenizer, produces a single token F31,
> with start offset 1 (good).
> But for its end offset, I would expect/want 4, but it produces 5
> today.
> This can be easily explained given how the mapping works: each time a
> mapping rule matches, we update the cumulative offset difference,
> conceptually as an array like this (it's encoded more compactly):
> {noformat}
>   Output offset: 0 1 2 3
>    Input offset: 1 2 3 5
> {noformat}
> When the tokenizer produces F31, it assigns it startOffset=0 and
> endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
> the CharFilter to correct those offsets, mapping them backwards
> through the above arrays, which creates startOffset=1 (good) and
> endOffset=5 (bad).
> At first, to fix this, I thought this is an "off-by-1" and when
> correcting the endOffset we really should return
> 1+correct(outputEndOffset-1), which would return the correct value (4)
> here.
> But that's too naive, e.g. here's another example:
> {noformat}
>   cccc -> cc
> {noformat}
> If I then tokenize cccc, today we produce the correct offsets (0, 4)
> but if we do this "off-by-1" fix for endOffset, we would get the wrong
> endOffset (2).
> I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to