Re: Japanese Tokenizer using User Dictionary

Christian Moen Mon, 04 Apr 2016 22:06:07 -0700

Hello Tomás,

What you are describing is the expected behaviour.  If you have any specific 
use cases that motivate how this perhaps should be changed, I'm very happy 
learn more about them to see how we can improve things.


Many thanks,

Christian Moen
アティリカ株式会社
https://www.atilika.com

> On Apr 5, 2016, at 04:39, Tomás Fernández Löbbe <tomasflo...@gmail.com> wrote:
> 
> If I understand correctly, the user dictionary in the JapaneseTokenizer 
> allows users to customize how a stream is broken into tokens using a specific 
> set of rules provided like: 
> AABBBCC -> AA BBB CC
> 
> It does not allow users to change any of the characters like:
> 
> AABBBCC -> DD BBB CC   (this will just tokenize to "AA", "BBB", "CC", seems 
> to only care about positions)
> 
> It also doesn't let a character be part of more than one token, like:
> 
> AABBBCC -> AAB BBB BCC (this will throw an AIOOBE)
> 
> ..or make the output token bigger than the input text: 
> 
> AA -> AAA (Also AIOOBE)
> 
> Is this the expected behavior? maybe cases 2-4 should be handled by adding 
> filters then. If so, is there any cases where the user dictionary should 
> accept any tokenization were the original text is different than the sum of 
> the tokens?
> 
> Tomás
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Japanese Tokenizer using User Dictionary

Reply via email to