When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave 
a hole
-------------------------------------------------------------------------------------

                 Key: LUCENE-3940
                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
             Project: Lucene - Java
          Issue Type: Bug
            Reporter: Michael McCandless
            Assignee: Michael McCandless
            Priority: Minor
             Fix For: 4.0


I modified BaseTokenStreamTestCase to assert that the start/end
offsets match for graph (posLen > 1) tokens, and this caught a bug in
Kuromoji when the decompounding of a compound token has a punctuation
token that's dropped.

In this case we should leave hole(s) so that the graph is intact, ie,
the graph should look the same as if the punctuation tokens were not
initially removed, but then a StopFilter had removed them.

This also affects tokens that have no compound over them, ie we fail
to leave a hole today when we remove the punctuation tokens.

I'm not sure this is serious enough to warrant fixing in 3.6 at the
last minute...


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to