Re: Issue with Japanese User Dictionary

2022-01-29 Thread Tomoko Uchida
Hi, Let me explain a brief background or intention of the change. Basically, character normalization is not a responsibility of a tokenizer and should not be performed when you "tokenize" texts. Instead, there are charFilters and tokenFilters that perform full-width and half-width normalization.

Re: Issue with Japanese User Dictionary

2022-01-13 Thread Tomoko Uchida
Hi, > The only thing that seems to differ is that the characters are full-width > vs half-width, so I was wondering if this is intended behavior or a bug/too > restrictive This is intended behavior. The first column in the user dictionary must be equal to the concatenated string of the second

Re: Issue with Japanese User Dictionary

2022-01-13 Thread Marc D'Mello
Hi Mike, Thanks for the response! I'm actually not super familiar with UserDictionaries, but looking at the code, it seems like a single line in the user provided user dictionary corresponds to a single entry? In that case, here is the line (or entry) that does have both widths that I believe is

Re: Issue with Japanese User Dictionary

2022-01-13 Thread Michael Sokolov
HI Marc, I wonder if there is a workaround for this issue: eg, could we have entries for both widths? I wonder if there is some interaction with an analysis chain that is doing half-width -> full-width conversion (or vice versa)? I think the UserDictionary has to operate on pre-analyzed tokens ...

Issue with Japanese User Dictionary

2022-01-12 Thread Marc D'Mello
Hi, I had a question about the Japanese user dictionary. We have a user dictionary that used to work but after attempting to upgrade Lucene, it fails with the following error: Caused by: java.lang.RuntimeException: Illegal user dictionary entry レコーダー - the concatenated segmentation (レコーダー) does