Japanese Tokenizer using User Dictionary

Tomás Fernández Löbbe Mon, 04 Apr 2016 12:40:30 -0700

If I understand correctly, the user dictionary in the JapaneseTokenizer
allows users to customize how a stream is broken into tokens using a
specific set of rules provided like:
AABBBCC -> AA BBB CC


It does not allow users to change any of the characters like:

AABBBCC -> DD BBB CC   (this will just tokenize to "AA", "BBB", "CC", seems
to only care about positions)

It also doesn't let a character be part of more than one token, like:

AABBBCC -> AAB BBB BCC (this will throw an AIOOBE)

..or make the output token bigger than the input text:

AA -> AAA (Also AIOOBE)

Is this the expected behavior? maybe cases 2-4 should be handled by adding
filters then. If so, is there any cases where the user dictionary should
accept any tokenization were the original text is different than the sum of
the tokens?

Tomás

Japanese Tokenizer using User Dictionary

Reply via email to