Re: How much is ja.dict.UserDictionary used?

2024-05-21 Thread Kazuaki Hiraga
I worked at a Japanese EC company before, and they used to have over 200,000 user dictionary entries. I am not sure they still use such a user dictionary, but the tokenizer and char/token filters cannot handle several writing variations. So, this is the important feature for Japanese handling.

Re: How much is ja.dict.UserDictionary used?

2024-05-21 Thread Christian Moen
Hello Bruno, It's an important and commonly used feature. Feel free to chime in on the improvements you have in mind. Thanks. Best, Christian On Sat, May 18, 2024 at 9:40 PM Bruno Roustant wrote: > Hi, > > While looking at the various usages of Map with Integer keys, I found >

Re: How much is ja.dict.UserDictionary used?

2024-05-21 Thread Walter Underwood
I worked at a couple of search engine vendors (Infoseek Ultraseek and MarkLogic), and user dictionaries are important for linguistic processing. Every application has some local jargon. With languages that don’t separate words with spaces (Chinese and Japanese), the tokenizer needs the user

Re: How much is ja.dict.UserDictionary used?

2024-05-18 Thread Michael Sokolov
We use it Amazon. I can't really read it so I'm not sure, but I think it's used to encode terms that come up that aren't handled well by the standard dictionary. On Sat, May 18, 2024 at 8:39 AM Bruno Roustant wrote: > > Hi, > > While looking at the various usages of Map with Integer keys, I

How much is ja.dict.UserDictionary used?

2024-05-18 Thread Bruno Roustant
Hi, While looking at the various usages of Map with Integer keys, I found ja.dict.UserDictionary with its lookup() method where there is a *TODO: can we avoid this treemap/toIndexArray?* I could propose something, but I would like to know how much it is used, and if it is worth improving it.