David Smiley created LUCENE-5734: ------------------------------------ Summary: HTMLStripCharFilter end offset should be left of closing tags Key: LUCENE-5734 URL: https://issues.apache.org/jira/browse/LUCENE-5734 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Reporter: David Smiley Priority: Minor
Consider this simple input: {noformat} <em>hello</em> {noformat} to be analyzed by HTMLStripCharFilter and WhitespaceTokenizer. You get back one token for "hello". Good. The start offset of this token is at the position of 'h' -- good. But the end offset is surprisingly plus one to the adjacent </em>. I argue that it should be plus one to the last character of the token (following 'o'). FYI it behaves as I expect if after hello is an -- the end offset immediately follows the 'o'. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org