Re: Japanese Tokenizer using User Dictionary

Tomás Fernández Löbbe Mon, 04 Apr 2016 23:13:04 -0700

Thanks Christian,
I don't have a different use case, but If what I said is the expected
behavior, I think we should validate the User Dictionary at create time
(and allow only proper tokenization) instead of breaking when using the
tokenizer.
If you agree I'll create a Jira for that.


Thanks,

Tomás

On Mon, Apr 4, 2016 at 10:05 PM, Christian Moen <c...@atilika.com> wrote:

> Hello Tomás,
>
> What you are describing is the expected behaviour.  If you have any
> specific use cases that motivate how this perhaps should be changed, I'm
> very happy learn more about them to see how we can improve things.
>
> Many thanks,
>
> Christian Moen
> アティリカ株式会社
> https://www.atilika.com
>
> > On Apr 5, 2016, at 04:39, Tomás Fernández Löbbe <tomasflo...@gmail.com>
> wrote:
> >
> > If I understand correctly, the user dictionary in the JapaneseTokenizer
> allows users to customize how a stream is broken into tokens using a
> specific set of rules provided like:
> > AABBBCC -> AA BBB CC
> >
> > It does not allow users to change any of the characters like:
> >
> > AABBBCC -> DD BBB CC   (this will just tokenize to "AA", "BBB", "CC",
> seems to only care about positions)
> >
> > It also doesn't let a character be part of more than one token, like:
> >
> > AABBBCC -> AAB BBB BCC (this will throw an AIOOBE)
> >
> > ..or make the output token bigger than the input text:
> >
> > AA -> AAA (Also AIOOBE)
> >
> > Is this the expected behavior? maybe cases 2-4 should be handled by
> adding filters then. If so, is there any cases where the user dictionary
> should accept any tokenization were the original text is different than the
> sum of the tokens?
> >
> > Tomás
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Japanese Tokenizer using User Dictionary

Reply via email to