Hi, Let me explain a brief background or intention of the change. Basically, character normalization is not a responsibility of a tokenizer and should not be performed when you "tokenize" texts. Instead, there are charFilters and tokenFilters that perform full-width and half-width normalization.
You can use either one for that purpose: - CJKWidthCharFilter (https://lucene.apache.org/core/9_0_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthCharFilter.html) - CJKWidthFilter (https://lucene.apache.org/core/9_0_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) Also, there are more general filters that perform Unicode normalization: - ICUNormalizer2CharFilter (https://lucene.apache.org/core/9_0_0/analysis/icu/org/apache/lucene/analysis/icu/ICUNormalizer2CharFilter.html) - ICUNormalizer2Filter (https://lucene.apache.org/core/9_0_0/analysis/icu/org/apache/lucene/analysis/icu/ICUNormalizer2Filter.html) I'd recommend charFilters here if you need a general suggestion - in most use-cases, character normalization should be done before applying dictionary-based tokenizers such as JapaneseTokenizer. JapaneseAnalyzer already includes CJKWidthCharFilter since Lucene 9.0 so you don't need to worry about full-width and half-width normalization if you use it. Tomoko 2022年1月14日(金) 11:58 Tomoko Uchida <tomoko.uchida.1...@gmail.com>: > > Hi, > > > The only thing that seems to differ is that the characters are full-width > > vs half-width, so I was wondering if this is intended behavior or a bug/too > > restrictive > > This is intended behavior. The first column in the user dictionary > must be equal to the concatenated string of the second column in terms > of Unicode codepoint. No normalization such as full-width and > half-width normalization should not be applied (any normalization or > tweak can cause runtime bugs). > > 2022年1月14日(金) 5:45 Marc D'Mello <marcd2...@gmail.com>: > > > > Hi Mike, > > > > Thanks for the response! I'm actually not super familiar with > > UserDictionaries, but looking at the code, it seems like a single line in > > the user provided user dictionary corresponds to a single entry? In that > > case, here is the line (or entry) that does have both widths that I believe > > is causing the problem: > > > > レコーダー,レコーダー,レコーダー,JA名詞 > > > > I'm guess here the surface is レコーダー and the concatentated segment is the > > first occurrence of レコーダー. I'm what surface or concatenated segment means > > though, and what it would mean semantically to replace the surface with the > > full width version or the concatenated segment with the half width version. > > > > Thanks, > > Marc > > > > > > On Thu, Jan 13, 2022 at 7:18 AM Michael Sokolov <msoko...@gmail.com> wrote: > > > > > HI Marc, I wonder if there is a workaround for this issue: eg, could > > > we have entries for both widths? I wonder if there is some interaction > > > with an analysis chain that is doing half-width -> full-width > > > conversion (or vice versa)? I think the UserDictionary has to operate > > > on pre-analyzed tokens ... although maybe *after* char filtering, > > > which presumably could handle width conversions. A bunch of rambling, > > > but maybe the point is - can you share some more information -- what > > > is the full entry in the dictionary that causes the problem? > > > > > > On Wed, Jan 12, 2022 at 7:04 PM Marc D'Mello <marcd2...@gmail.com> wrote: > > > > > > > > Hi, > > > > > > > > I had a question about the Japanese user dictionary. We have a user > > > > dictionary that used to work but after attempting to upgrade Lucene, it > > > > fails with the following error: > > > > > > > > Caused by: java.lang.RuntimeException: Illegal user dictionary entry > > > レコーダー > > > > - the concatenated segmentation (レコーダー) does not match the surface form > > > > (レコーダー) > > > > at > > > > > > > org.apache.lucene.analysis.ja.dict.UserDictionary.<init>(UserDictionary.java:123) > > > > > > > > The specific commit causing this error is here > > > > < > > > https://github.com/apache/lucene/commit/73ba88a50dec64f367caa88d277c26dfd1d8883b#diff-75fd48fadfd3d011e9c34c4310ef66e9009edfbc738fd82deb5661a8edb5c5d9 > > > >. > > > > The only thing that seems to differ is that the characters are > > > > full-width > > > > vs half-width, so I was wondering if this is intended behavior or a > > > bug/too > > > > restrictive. Any suggestions for fixing this would be greatly > > > appreciated! > > > > Thanks! > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org