[jira] [Updated] (LUCENE-5278) MockTokenizer throws away the character right after a token even if it is a valid start to a new token

Robert Muir (JIRA) Fri, 11 Oct 2013 17:49:53 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated LUCENE-5278:
--------------------------------

    Attachment: LUCENE-5278.patch

Nice patch Nik!

I think this is ready: i tweaked variable names and rearranged stuff (e.g. i 
use -1 instead of Integer so we arent boxing and a few other things).

I also added some unit tests.

The main issues why tests were failing with your original patch:
* reset() needed to clear the buffer variables.
* the state machine needed some particular extra check when emitting a token: 
e.g. if you make a regex of "..", but you send it "abcde", the tokens should be 
"ab", "cd", but not "e". so when we end on a partial match, we have to check 
that we are in an accept state.
* term-limit-exceeded is a special case (versus last character being in a 
reject state)

> MockTokenizer throws away the character right after a token even if it is a 
> valid start to a new token
> ------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5278
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Nik Everett
>            Assignee: Robert Muir
>            Priority: Trivial
>         Attachments: LUCENE-5278.patch, LUCENE-5278.patch
>
>
> MockTokenizer throws away the character right after a token even if it is a 
> valid start to a new token.  You won't see this unless you build a tokenizer 
> that can recognize every character like with new RegExp(".") or RegExp("...").
> Changing this behaviour seems to break a number of tests.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5278) MockTokenizer throws away the character right after a token even if it is a valid start to a new token

Reply via email to