[ 
https://issues.apache.org/jira/browse/LUCENE-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792993#comment-13792993
 ] 

Robert Muir commented on LUCENE-5278:
-------------------------------------

I think i understand what you want: it makes sense. The only reason its the way 
it is today is because this thing historically came from CharTokenizer (see the 
isTokenChar?).

But it would be better if you could e.g. make a pattern like ([A-Z]a-z+) and 
for it to actually break FooBar into Foo, Bar rather than throwout out "bar" 
all together.

I'll dig into this!

> MockTokenizer throws away the character right after a token even if it is a 
> valid start to a new token
> ------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5278
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Nik Everett
>            Priority: Trivial
>         Attachments: LUCENE-5278.patch
>
>
> MockTokenizer throws away the character right after a token even if it is a 
> valid start to a new token.  You won't see this unless you build a tokenizer 
> that can recognize every character like with new RegExp(".") or RegExp("...").
> Changing this behaviour seems to break a number of tests.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to