[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595289#comment-14595289 ]
Cao Manh Dat commented on LUCENE-6595: -------------------------------------- Currently CharFilter have two problems. Problem 1: {code} Input : A B C ) ) ) Output : A B C {code} When Tokenizer ask to correct offset of 3 (which is C in output). This offset related to offset 3 4 5 6 in the input. CharFilter will correct offset of C to 6 ( end of range ). So why cccc -> cc have correct offset? {code} Input : c c c c Output : c c {code} Because offset 2 (which is the second c in output) related to offset 2 3 4 in the input. CharFilter will correct offset 2 to 4 (end of range, which is correct). The different of two examples, In Ex1 : the replacement happen right in the correct point (at 3) and in Ex2 : the replacement happen before the correct point (at 0). So I store an inputOffsets[] which is the start for each replacements. Problem 2: {code} Input : A <space> ( C Output : A <space> C {code} When Tokenizer ask to correct offset of 3 (which is C in output). This offset related to offset 3 4 in the input. CharFilter will correct offset of C to 4 (end of range, which is correct). But in this example the replacement also happen right in the correct point. So there is a difference between correct startOffset and endOffset. The root of problems is we mapping N -> 1 and then asking an inverse mapping 1 -> 1. [~dsmiley] I will look at LUCENE-5734 and try to fix that bug. > CharFilter offsets correction is wonky > -------------------------------------- > > Key: LUCENE-6595 > URL: https://issues.apache.org/jira/browse/LUCENE-6595 > Project: Lucene - Core > Issue Type: Bug > Reporter: Michael McCandless > Attachments: LUCENE-6595.patch > > > Spinoff from this original Elasticsearch issue: > https://github.com/elastic/elasticsearch/issues/11726 > If I make a MappingCharFilter with these mappings: > {noformat} > ( -> > ) -> > {noformat} > i.e., just erase left and right paren, then tokenizing the string > "(F31)" with e.g. WhitespaceTokenizer, produces a single token F31, > with start offset 1 (good). > But for its end offset, I would expect/want 4, but it produces 5 > today. > This can be easily explained given how the mapping works: each time a > mapping rule matches, we update the cumulative offset difference, > conceptually as an array like this (it's encoded more compactly): > {noformat} > Output offset: 0 1 2 3 > Input offset: 1 2 3 5 > {noformat} > When the tokenizer produces F31, it assigns it startOffset=0 and > endOffset=3 based on the characters it sees (F, 3, 1). It then asks > the CharFilter to correct those offsets, mapping them backwards > through the above arrays, which creates startOffset=1 (good) and > endOffset=5 (bad). > At first, to fix this, I thought this is an "off-by-1" and when > correcting the endOffset we really should return > 1+correct(outputEndOffset-1), which would return the correct value (4) > here. > But that's too naive, e.g. here's another example: > {noformat} > cccc -> cc > {noformat} > If I then tokenize cccc, today we produce the correct offsets (0, 4) > but if we do this "off-by-1" fix for endOffset, we would get the wrong > endOffset (2). > I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org