Thanks Christian, I don't have a different use case, but If what I said is the expected behavior, I think we should validate the User Dictionary at create time (and allow only proper tokenization) instead of breaking when using the tokenizer. If you agree I'll create a Jira for that.
Thanks, Tomás On Mon, Apr 4, 2016 at 10:05 PM, Christian Moen <c...@atilika.com> wrote: > Hello Tomás, > > What you are describing is the expected behaviour. If you have any > specific use cases that motivate how this perhaps should be changed, I'm > very happy learn more about them to see how we can improve things. > > Many thanks, > > Christian Moen > アティリカ株式会社 > https://www.atilika.com > > > On Apr 5, 2016, at 04:39, Tomás Fernández Löbbe <tomasflo...@gmail.com> > wrote: > > > > If I understand correctly, the user dictionary in the JapaneseTokenizer > allows users to customize how a stream is broken into tokens using a > specific set of rules provided like: > > AABBBCC -> AA BBB CC > > > > It does not allow users to change any of the characters like: > > > > AABBBCC -> DD BBB CC (this will just tokenize to "AA", "BBB", "CC", > seems to only care about positions) > > > > It also doesn't let a character be part of more than one token, like: > > > > AABBBCC -> AAB BBB BCC (this will throw an AIOOBE) > > > > ..or make the output token bigger than the input text: > > > > AA -> AAA (Also AIOOBE) > > > > Is this the expected behavior? maybe cases 2-4 should be handled by > adding filters then. If so, is there any cases where the user dictionary > should accept any tokenization were the original text is different than the > sum of the tokens? > > > > Tomás > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >