RE: ICUTransformFilter with traditional to simplified Chinese

Eyal Naamati Tue, 19 Dec 2017 06:07:13 -0800

Thanks!
 I actually did ready the Stanford posts when we implemented our index, it was 
very helpful!

-----Original Message-----
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Tuesday, December 19, 2017 1:31 AM
To: solr-user@lucene.apache.org
Subject: Re: ICUTransformFilter with traditional to simplified Chinese

On 12/18/2017 9:49 AM, Eyal Naamati wrote:
> We are using the ICUTransformFilter to normalize traditional Chinese text to 
> simplified Chinese.
> We received feedback from some of our Chinese customers that there are some 
> traditional characters that are not converted to their simplified variants.
> For example:
> "眞" should be converted to "真"
> "硏" should be converted to "研"
> "夲" should be converted to "本"
>
> Does anyone know if this is indeed a problem with the filter?
> Or if there are other options to use instead of this filter that handle more 
> characters?

I have one index for a website we built for a customer in Japan.  While 
researching how to effectively handle CJK characters, I came across an entire 
series of blog posts.  Here's the first post, you can check other posts on the 
same blog for most posts on the same subject.  There are a lot of them:

https://urldefense.proofpoint.com/v2/url?u=http-3A__discovery-2Dgrindstone.blogspot.com_2013_10_cjk-2Dwith-2Dsolr-2Dfor-2Dlibraries-2Dpart-2D1.html&d=DwIDaQ&c=WMhnfwkfN4LR6wX29ZSgFCZf_hw4vy5MAv7iZJNaAD4&r=S7QZWqfcOLl62Mpd8PUcA3-3z78voYLEsrnT2uiQKyE&m=J8kpguaEjPrrfMdxNEkG3iroVDzr3790eDeGGSR38iw&s=ZsqkNmNtZFgRxog-CW6KYJ28NtGoZq91tuixLQ8lJIw&e=

One of the filters that Stanford utilized (and we also implemented) is a custom 
filter that they wrote, apparently specifically because there are things that 
the ICU filters included with Lucene do not catch.

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sul-2Ddlss_CJKFoldingFilter&d=DwIDaQ&c=WMhnfwkfN4LR6wX29ZSgFCZf_hw4vy5MAv7iZJNaAD4&r=S7QZWqfcOLl62Mpd8PUcA3-3z78voYLEsrnT2uiQKyE&m=J8kpguaEjPrrfMdxNEkG3iroVDzr3790eDeGGSR38iw&s=3-FHJky_wxpuxfDuVVbukGBeYtL43_G49vBH7xaTStY&e=

Looking into the code for the custom filter and checking into your first 
example, this filter actually seems to go in the reverse direction -- it 
converts 真 to 眞.  I did not look into the other examples, and I'm completely 
clueless about CJK characters, so I don't know what those characters are or 
what the correct action would be.

That third-party custom filter would probably be helpful to you.  Even though 
it goes in the reverse direction for your first example, as long as the 
behavior at index time and query time is the same, you should still get 
matches.  End users would most likely never see the results of the analysis.

Whether or not the behavior you've noticed is a bug with ICUTransformFilter is 
a question that I cannot answer.  If it is, then the bug will be in ICU, not 
Lucene.

https://urldefense.proofpoint.com/v2/url?u=http-3A__lucene.apache.org_core_7-5F1-5F0_analyzers-2Dicu_org_apache_lucene_analysis_icu_ICUTransformFilter.html&d=DwIDaQ&c=WMhnfwkfN4LR6wX29ZSgFCZf_hw4vy5MAv7iZJNaAD4&r=S7QZWqfcOLl62Mpd8PUcA3-3z78voYLEsrnT2uiQKyE&m=J8kpguaEjPrrfMdxNEkG3iroVDzr3790eDeGGSR38iw&s=XoPsu6iF8r_aEHXuep-m3vILU8vIfilW0uv82ZRQtUA&e=

Thanks,
Shawn

RE: ICUTransformFilter with traditional to simplified Chinese

Reply via email to