[ 
https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243785#comment-13243785
 ] 

Michael McCandless commented on LUCENE-3940:
--------------------------------------------

OK here's one possible fix...

Right now, when we are glomming up an UNKNOWN token, we glom only as long as 
the character class of each character is the same as the first character.

What if we also require that isPunct-ness is the same?  That way we would never 
create an UNKNOWN token mixing punct and non-punct...

I implemented that and the tests seem to pass w/ offset checking fully turned 
on again...
                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should 
> leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to