[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14619814#comment-14619814 ]
Cao Manh Dat commented on LUCENE-6595: -------------------------------------- Thanks [~mikemccand]! {quote} @@ -215,7 +230,8 @@ }; int numRounds = RANDOM_MULTIPLIER * 10000; - checkRandomData(random(), analyzer, numRounds); +// checkRandomData(random(), analyzer, numRounds); + checkAnalysisConsistency(random(),analyzer,true,"m?(y '&"); analyzer.close(); } {quote} My fault, I played around with the test and forgot to roll back. {quote} It's spooky the test fails because with the right default here (hmm maybe it should be {code} off + cumulativeDiff {code} since it's an input offset, it should behave exactly has before? {quote} Nice idea, I changed it will {code} off - cumulativeDiff {code} and i work perfectly {quote} For the default impl for CharFilter.correctEnd should we just use CharFilter.correct? Can we rename correctOffset --> correctStartOffset now that we also have a correctEndOffset? {quote} Nice refactoring. {quote} Does (correctOffset(endOffset-1)+1) not work? It would be nice not to add the new method to CharFilter (only to Tokenizer). {quote} I tried to do that, but it cant be. Because the information for the special case lie down in BaseCharFilter. [~rcmuir] I will try to explain the solution in a slide, I'm quite not good at it :( > CharFilter offsets correction is wonky > -------------------------------------- > > Key: LUCENE-6595 > URL: https://issues.apache.org/jira/browse/LUCENE-6595 > Project: Lucene - Core > Issue Type: Bug > Reporter: Michael McCandless > Attachments: LUCENE-6595.patch, LUCENE-6595.patch, LUCENE-6595.patch > > > Spinoff from this original Elasticsearch issue: > https://github.com/elastic/elasticsearch/issues/11726 > If I make a MappingCharFilter with these mappings: > {noformat} > ( -> > ) -> > {noformat} > i.e., just erase left and right paren, then tokenizing the string > "(F31)" with e.g. WhitespaceTokenizer, produces a single token F31, > with start offset 1 (good). > But for its end offset, I would expect/want 4, but it produces 5 > today. > This can be easily explained given how the mapping works: each time a > mapping rule matches, we update the cumulative offset difference, > conceptually as an array like this (it's encoded more compactly): > {noformat} > Output offset: 0 1 2 3 > Input offset: 1 2 3 5 > {noformat} > When the tokenizer produces F31, it assigns it startOffset=0 and > endOffset=3 based on the characters it sees (F, 3, 1). It then asks > the CharFilter to correct those offsets, mapping them backwards > through the above arrays, which creates startOffset=1 (good) and > endOffset=5 (bad). > At first, to fix this, I thought this is an "off-by-1" and when > correcting the endOffset we really should return > 1+correct(outputEndOffset-1), which would return the correct value (4) > here. > But that's too naive, e.g. here's another example: > {noformat} > cccc -> cc > {noformat} > If I then tokenize cccc, today we produce the correct offsets (0, 4) > but if we do this "off-by-1" fix for endOffset, we would get the wrong > endOffset (2). > I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org