I worked at a Japanese EC company before, and they used to have over
200,000 user dictionary entries. I am not sure they still use such a user
dictionary, but the tokenizer and char/token filters cannot handle several
writing variations. So, this is the important feature for Japanese
handling.
Be
Hello Bruno,
It's an important and commonly used feature. Feel free to chime in on the
improvements you have in mind. Thanks.
Best,
Christian
On Sat, May 18, 2024 at 9:40 PM Bruno Roustant
wrote:
> Hi,
>
> While looking at the various usages of Map with Integer keys, I found
> ja.dict.UserD
I worked at a couple of search engine vendors (Infoseek Ultraseek and
MarkLogic), and user dictionaries are important for linguistic processing.
Every application has some local jargon.
With languages that don’t separate words with spaces (Chinese and Japanese),
the tokenizer needs the user dic
We use it Amazon. I can't really read it so I'm not sure, but I think
it's used to encode terms that come up that aren't handled well by the
standard dictionary.
On Sat, May 18, 2024 at 8:39 AM Bruno Roustant wrote:
>
> Hi,
>
> While looking at the various usages of Map with Integer keys, I found
Hi,
While looking at the various usages of Map with Integer keys, I found
ja.dict.UserDictionary with its lookup() method where there is a *TODO: can
we avoid this treemap/toIndexArray?*
I could propose something, but I would like to know how much it is used,
and if it is worth improving it.
Tha