Thanks! I actually did ready the Stanford posts when we implemented our index, it was very helpful!
-----Original Message----- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Tuesday, December 19, 2017 1:31 AM To: solr-user@lucene.apache.org Subject: Re: ICUTransformFilter with traditional to simplified Chinese On 12/18/2017 9:49 AM, Eyal Naamati wrote: > We are using the ICUTransformFilter to normalize traditional Chinese text to > simplified Chinese. > We received feedback from some of our Chinese customers that there are some > traditional characters that are not converted to their simplified variants. > For example: > "眞" should be converted to "真" > "硏" should be converted to "研" > "夲" should be converted to "本" > > Does anyone know if this is indeed a problem with the filter? > Or if there are other options to use instead of this filter that handle more > characters? I have one index for a website we built for a customer in Japan. While researching how to effectively handle CJK characters, I came across an entire series of blog posts. Here's the first post, you can check other posts on the same blog for most posts on the same subject. There are a lot of them: https://urldefense.proofpoint.com/v2/url?u=http-3A__discovery-2Dgrindstone.blogspot.com_2013_10_cjk-2Dwith-2Dsolr-2Dfor-2Dlibraries-2Dpart-2D1.html&d=DwIDaQ&c=WMhnfwkfN4LR6wX29ZSgFCZf_hw4vy5MAv7iZJNaAD4&r=S7QZWqfcOLl62Mpd8PUcA3-3z78voYLEsrnT2uiQKyE&m=J8kpguaEjPrrfMdxNEkG3iroVDzr3790eDeGGSR38iw&s=ZsqkNmNtZFgRxog-CW6KYJ28NtGoZq91tuixLQ8lJIw&e= One of the filters that Stanford utilized (and we also implemented) is a custom filter that they wrote, apparently specifically because there are things that the ICU filters included with Lucene do not catch. https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sul-2Ddlss_CJKFoldingFilter&d=DwIDaQ&c=WMhnfwkfN4LR6wX29ZSgFCZf_hw4vy5MAv7iZJNaAD4&r=S7QZWqfcOLl62Mpd8PUcA3-3z78voYLEsrnT2uiQKyE&m=J8kpguaEjPrrfMdxNEkG3iroVDzr3790eDeGGSR38iw&s=3-FHJky_wxpuxfDuVVbukGBeYtL43_G49vBH7xaTStY&e= Looking into the code for the custom filter and checking into your first example, this filter actually seems to go in the reverse direction -- it converts 真 to 眞. I did not look into the other examples, and I'm completely clueless about CJK characters, so I don't know what those characters are or what the correct action would be. That third-party custom filter would probably be helpful to you. Even though it goes in the reverse direction for your first example, as long as the behavior at index time and query time is the same, you should still get matches. End users would most likely never see the results of the analysis. Whether or not the behavior you've noticed is a bug with ICUTransformFilter is a question that I cannot answer. If it is, then the bug will be in ICU, not Lucene. https://urldefense.proofpoint.com/v2/url?u=http-3A__lucene.apache.org_core_7-5F1-5F0_analyzers-2Dicu_org_apache_lucene_analysis_icu_ICUTransformFilter.html&d=DwIDaQ&c=WMhnfwkfN4LR6wX29ZSgFCZf_hw4vy5MAv7iZJNaAD4&r=S7QZWqfcOLl62Mpd8PUcA3-3z78voYLEsrnT2uiQKyE&m=J8kpguaEjPrrfMdxNEkG3iroVDzr3790eDeGGSR38iw&s=XoPsu6iF8r_aEHXuep-m3vILU8vIfilW0uv82ZRQtUA&e= Thanks, Shawn