When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole -------------------------------------------------------------------------------------
Key: LUCENE-3940 URL: https://issues.apache.org/jira/browse/LUCENE-3940 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 4.0 I modified BaseTokenStreamTestCase to assert that the start/end offsets match for graph (posLen > 1) tokens, and this caught a bug in Kuromoji when the decompounding of a compound token has a punctuation token that's dropped. In this case we should leave hole(s) so that the graph is intact, ie, the graph should look the same as if the punctuation tokens were not initially removed, but then a StopFilter had removed them. This also affects tokens that have no compound over them, ie we fail to leave a hole today when we remove the punctuation tokens. I'm not sure this is serious enough to warrant fixing in 3.6 at the last minute... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org