Thanks Christian, I created https://issues.apache.org/jira/browse/LUCENE-7181
On Mon, Apr 4, 2016 at 11:38 PM, Christian Moen <c...@atilika.com> wrote: > Hello again Tomás, > > Thanks. I agree entirely. If you open a JIRA and I'll have a look and > make improvements. > > Best regards, > > Christian Moen > アティリカ株式会社 > https://www.atilika.com > > On Apr 5, 2016, at 15:12, Tomás Fernández Löbbe <tomasflo...@gmail.com> > wrote: > > Thanks Christian, > I don't have a different use case, but If what I said is the expected > behavior, I think we should validate the User Dictionary at create time > (and allow only proper tokenization) instead of breaking when using the > tokenizer. > If you agree I'll create a Jira for that. > > Thanks, > > Tomás > > On Mon, Apr 4, 2016 at 10:05 PM, Christian Moen <c...@atilika.com> wrote: > >> Hello Tomás, >> >> What you are describing is the expected behaviour. If you have any >> specific use cases that motivate how this perhaps should be changed, I'm >> very happy learn more about them to see how we can improve things. >> >> Many thanks, >> >> Christian Moen >> アティリカ株式会社 >> https://www.atilika.com >> >> > On Apr 5, 2016, at 04:39, Tomás Fernández Löbbe <tomasflo...@gmail.com> >> wrote: >> > >> > If I understand correctly, the user dictionary in the JapaneseTokenizer >> allows users to customize how a stream is broken into tokens using a >> specific set of rules provided like: >> > AABBBCC -> AA BBB CC >> > >> > It does not allow users to change any of the characters like: >> > >> > AABBBCC -> DD BBB CC (this will just tokenize to "AA", "BBB", "CC", >> seems to only care about positions) >> > >> > It also doesn't let a character be part of more than one token, like: >> > >> > AABBBCC -> AAB BBB BCC (this will throw an AIOOBE) >> > >> > ..or make the output token bigger than the input text: >> > >> > AA -> AAA (Also AIOOBE) >> > >> > Is this the expected behavior? maybe cases 2-4 should be handled by >> adding filters then. If so, is there any cases where the user dictionary >> should accept any tokenization were the original text is different than the >> sum of the tokens? >> > >> > Tomás >> > >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > >