[ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-3940: --------------------------------------- Attachment: LUCENE-3940.patch New patch, fixing a bug in the last one, and adding a few more test cases. I also made the "print curious string on exception" from BTSTC more ascii-friendly. I think it's ready. > When Japanese (Kuromoji) tokenizer removes a punctuation token it should > leave a hole > ------------------------------------------------------------------------------------- > > Key: LUCENE-3940 > URL: https://issues.apache.org/jira/browse/LUCENE-3940 > Project: Lucene - Java > Issue Type: Bug > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 4.0 > > Attachments: LUCENE-3940.patch, LUCENE-3940.patch > > > I modified BaseTokenStreamTestCase to assert that the start/end > offsets match for graph (posLen > 1) tokens, and this caught a bug in > Kuromoji when the decompounding of a compound token has a punctuation > token that's dropped. > In this case we should leave hole(s) so that the graph is intact, ie, > the graph should look the same as if the punctuation tokens were not > initially removed, but then a StopFilter had removed them. > This also affects tokens that have no compound over them, ie we fail > to leave a hole today when we remove the punctuation tokens. > I'm not sure this is serious enough to warrant fixing in 3.6 at the > last minute... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org