Hello again Tomás, Thanks. I agree entirely. If you open a JIRA and I'll have a look and make improvements.
Best regards, Christian Moen アティリカ株式会社 https://www.atilika.com > On Apr 5, 2016, at 15:12, Tomás Fernández Löbbe <tomasflo...@gmail.com> wrote: > > Thanks Christian, > I don't have a different use case, but If what I said is the expected > behavior, I think we should validate the User Dictionary at create time (and > allow only proper tokenization) instead of breaking when using the tokenizer. > If you agree I'll create a Jira for that. > > Thanks, > > Tomás > > On Mon, Apr 4, 2016 at 10:05 PM, Christian Moen <c...@atilika.com > <mailto:c...@atilika.com>> wrote: > Hello Tomás, > > What you are describing is the expected behaviour. If you have any specific > use cases that motivate how this perhaps should be changed, I'm very happy > learn more about them to see how we can improve things. > > Many thanks, > > Christian Moen > アティリカ株式会社 > https://www.atilika.com <https://www.atilika.com/> > > > On Apr 5, 2016, at 04:39, Tomás Fernández Löbbe <tomasflo...@gmail.com > > <mailto:tomasflo...@gmail.com>> wrote: > > > > If I understand correctly, the user dictionary in the JapaneseTokenizer > > allows users to customize how a stream is broken into tokens using a > > specific set of rules provided like: > > AABBBCC -> AA BBB CC > > > > It does not allow users to change any of the characters like: > > > > AABBBCC -> DD BBB CC (this will just tokenize to "AA", "BBB", "CC", seems > > to only care about positions) > > > > It also doesn't let a character be part of more than one token, like: > > > > AABBBCC -> AAB BBB BCC (this will throw an AIOOBE) > > > > ..or make the output token bigger than the input text: > > > > AA -> AAA (Also AIOOBE) > > > > Is this the expected behavior? maybe cases 2-4 should be handled by adding > > filters then. If so, is there any cases where the user dictionary should > > accept any tokenization were the original text is different than the sum of > > the tokens? > > > > Tomás > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > <mailto:dev-unsubscr...@lucene.apache.org> > For additional commands, e-mail: dev-h...@lucene.apache.org > <mailto:dev-h...@lucene.apache.org> > >