Re: Japanese Tokenizer using User Dictionary

Christian Moen Mon, 04 Apr 2016 23:38:44 -0700

Hello again Tomás,

Thanks.  I agree entirely.  If you open a JIRA and I'll have a look and make 
improvements.


Best regards,

Christian Moen
アティリカ株式会社
https://www.atilika.com

> On Apr 5, 2016, at 15:12, Tomás Fernández Löbbe <tomasflo...@gmail.com> wrote:
> 
> Thanks Christian, 
> I don't have a different use case, but If what I said is the expected 
> behavior, I think we should validate the User Dictionary at create time (and 
> allow only proper tokenization) instead of breaking when using the tokenizer. 
> If you agree I'll create a Jira for that.
> 
> Thanks, 
> 
> Tomás
> 
> On Mon, Apr 4, 2016 at 10:05 PM, Christian Moen <c...@atilika.com 
> <mailto:c...@atilika.com>> wrote:
> Hello Tomás,
> 
> What you are describing is the expected behaviour.  If you have any specific 
> use cases that motivate how this perhaps should be changed, I'm very happy 
> learn more about them to see how we can improve things.
> 
> Many thanks,
> 
> Christian Moen
> アティリカ株式会社
> https://www.atilika.com <https://www.atilika.com/>
> 
> > On Apr 5, 2016, at 04:39, Tomás Fernández Löbbe <tomasflo...@gmail.com 
> > <mailto:tomasflo...@gmail.com>> wrote:
> >
> > If I understand correctly, the user dictionary in the JapaneseTokenizer 
> > allows users to customize how a stream is broken into tokens using a 
> > specific set of rules provided like:
> > AABBBCC -> AA BBB CC
> >
> > It does not allow users to change any of the characters like:
> >
> > AABBBCC -> DD BBB CC   (this will just tokenize to "AA", "BBB", "CC", seems 
> > to only care about positions)
> >
> > It also doesn't let a character be part of more than one token, like:
> >
> > AABBBCC -> AAB BBB BCC (this will throw an AIOOBE)
> >
> > ..or make the output token bigger than the input text:
> >
> > AA -> AAA (Also AIOOBE)
> >
> > Is this the expected behavior? maybe cases 2-4 should be handled by adding 
> > filters then. If so, is there any cases where the user dictionary should 
> > accept any tokenization were the original text is different than the sum of 
> > the tokens?
> >
> > Tomás
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
> <mailto:dev-unsubscr...@lucene.apache.org>
> For additional commands, e-mail: dev-h...@lucene.apache.org 
> <mailto:dev-h...@lucene.apache.org>
> 
>

Re: Japanese Tokenizer using User Dictionary

Reply via email to