[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597883#comment-14597883 ]
Cao Manh Dat commented on LUCENE-6595: -------------------------------------- Thanks [~mikemccand]. I quite confuse about finalOffset of Tokenizer. For example {code} Input : ABC))) Output : ABC {code} The end offset of last term is 3. So finalOffset should be 3 or 6? > CharFilter offsets correction is wonky > -------------------------------------- > > Key: LUCENE-6595 > URL: https://issues.apache.org/jira/browse/LUCENE-6595 > Project: Lucene - Core > Issue Type: Bug > Reporter: Michael McCandless > Attachments: LUCENE-6595.patch > > > Spinoff from this original Elasticsearch issue: > https://github.com/elastic/elasticsearch/issues/11726 > If I make a MappingCharFilter with these mappings: > {noformat} > ( -> > ) -> > {noformat} > i.e., just erase left and right paren, then tokenizing the string > "(F31)" with e.g. WhitespaceTokenizer, produces a single token F31, > with start offset 1 (good). > But for its end offset, I would expect/want 4, but it produces 5 > today. > This can be easily explained given how the mapping works: each time a > mapping rule matches, we update the cumulative offset difference, > conceptually as an array like this (it's encoded more compactly): > {noformat} > Output offset: 0 1 2 3 > Input offset: 1 2 3 5 > {noformat} > When the tokenizer produces F31, it assigns it startOffset=0 and > endOffset=3 based on the characters it sees (F, 3, 1). It then asks > the CharFilter to correct those offsets, mapping them backwards > through the above arrays, which creates startOffset=1 (good) and > endOffset=5 (bad). > At first, to fix this, I thought this is an "off-by-1" and when > correcting the endOffset we really should return > 1+correct(outputEndOffset-1), which would return the correct value (4) > here. > But that's too naive, e.g. here's another example: > {noformat} > cccc -> cc > {noformat} > If I then tokenize cccc, today we produce the correct offsets (0, 4) > but if we do this "off-by-1" fix for endOffset, we would get the wrong > endOffset (2). > I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org