Re: Question regarding searching Chinese characters

Walter Underwood Fri, 20 Jul 2018 08:05:30 -0700

I expect that this is the line that does the transformation:

   <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>


This mapping is a standard feature of ICU. More info on ICU transforms is in 
this doc, though not much detail on this particular transform. 

http://userguide.icu-project.org/transforms/general

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 20, 2018, at 7:43 AM, Susheel Kumar <susheel2...@gmail.com> wrote:
> 
> I think so.  I used the exact as in github
> 
> <fieldType name="text_cjk" class="solr.TextField"
> positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>  <analyzer>
>    <tokenizer class="solr.ICUTokenizerFactory" />
>    <filter class="solr.CJKWidthFilterFactory"/>
>    <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>    <filter class="solr.ICUTransformFilterFactory" 
> id="Traditional-Simplified"/>
>    <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
>    <filter class="solr.ICUFoldingFilterFactory"/>
>    <filter class="solr.CJKBigramFilterFactory" han="true"
> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>  </analyzer>
> </fieldType>
> 
> 
> 
> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <amanda.shu...@gmail.com>
> wrote:
> 
>> Thanks! That does indeed look promising... This can be added on top of
>> Smart Chinese, right? Or is it an alternative?
>> 
>> 
>> ------
>> Dr. Amanda Shuman
>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>> <http://www.maoistlegacy.uni-freiburg.de/>
>> PhD, University of California, Santa Cruz
>> http://www.amandashuman.net/
>> http://www.prchistoryresources.org/
>> Office: +49 (0) 761 203 4925
>> 
>> 
>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <susheel2...@gmail.com>
>> wrote:
>> 
>>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
>>> each of A, B or C or D in query and they seems to be matching and CJKFF
>> is
>>> transforming the 舊 to 旧
>>> 
>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <susheel2...@gmail.com>
>>> wrote:
>>> 
>>>> Lack of my chinese language knowledge but if you want, I can do quick
>>> test
>>>> for you in Analysis tab if you can give me what to put in index and
>> query
>>>> window...
>>>> 
>>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <susheel2...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Have you tried to use CJKFoldingFilter https://g
>>>>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
>> cover
>>>>> your use case but I am using this filter and so far no issues.
>>>>> 
>>>>> Thnx
>>>>> 
>>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
>> amanda.shu...@gmail.com
>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Thanks, Alex - I have seen a few of those links but never considered
>>>>>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
>>>>>> basically what is laid out in the old blogspot post, namely this
>> point:
>>>>>> 
>>>>>> 
>>>>>> "Why approach CJK resource discovery differently?
>>>>>> 
>>>>>> 2.  Search results must be as script agnostic as possible.
>>>>>> 
>>>>>> There is more than one way to write each word. "Simplified"
>> characters
>>>>>> were
>>>>>> emphasized for printed materials in mainland China starting in the
>>> 1950s;
>>>>>> "Traditional" characters were used in printed materials prior to the
>>>>>> 1950s,
>>>>>> and are still used in Taiwan, Hong Kong and Macau today.
>>>>>> Since the characters are distinct, it's as if Chinese materials are
>>>>>> written
>>>>>> in two scripts.
>>>>>> Another way to think about it:  every written Chinese word has at
>> least
>>>>>> two
>>>>>> completely different spellings.  And it can be mix-n-match:  a word
>> can
>>>>>> be
>>>>>> written with one traditional  and one simplified character.
>>>>>> Example:   Given a user query 舊小說  (traditional for old fiction), the
>>>>>> results should include matches for 舊小說 (traditional) and 旧小说
>>> (simplified
>>>>>> characters for old fiction)"
>>>>>> 
>>>>>> So, using the example provided above, we are dealing with materials
>>>>>> produced in the 1950s-1970s that do even weirder things like:
>>>>>> 
>>>>>> A. 舊小說
>>>>>> 
>>>>>> can also be
>>>>>> 
>>>>>> B. 旧小说 (all simplified)
>>>>>> or
>>>>>> C. 旧小說 (first character simplified, last character traditional)
>>>>>> or
>>>>>> D. 舊小 说 (first character traditional, last character simplified)
>>>>>> 
>>>>>> Thankfully the middle character was never simplified in recent times.
>>>>>> 
>>>>>> From a historical standpoint, the mixed nature of the characters in
>> the
>>>>>> same word/phrase is because not all simplified characters were
>> adopted
>>> at
>>>>>> the same time by everyone uniformly (good times...).
>>>>>> 
>>>>>> The problem seems to be that Solr can easily handle A or B above, but
>>>>>> NOT C
>>>>>> or D using the Smart Chinese analyzer. I'm not really sure how to
>>> change
>>>>>> that at this point... maybe I should figure out how to contact the
>>>>>> creators
>>>>>> of the analyzer and ask them?
>>>>>> 
>>>>>> Amanda
>>>>>> 
>>>>>> ------
>>>>>> Dr. Amanda Shuman
>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
>> Project
>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
>>>>>> PhD, University of California, Santa Cruz
>>>>>> http://www.amandashuman.net/
>>>>>> http://www.prchistoryresources.org/
>>>>>> Office: +49 (0) 761 203 4925
>>>>>> 
>>>>>> 
>>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
>>>>>> arafa...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> This is probably your start, if not read already:
>>>>>>> https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>>>>>>> 
>>>>>>> Otherwise, I think your answer would be somewhere around using
>> ICU4J,
>>>>>>> IBM's library for dealing with Unicode:
>> http://site.icu-project.org/
>>>>>>> (mentioned on the same page above)
>>>>>>> Specifically, transformations:
>>>>>>> http://userguide.icu-project.org/transforms/general
>>>>>>> 
>>>>>>> With that, maybe you map both alphabets into latin. I did that once
>>>>>>> for Thai for a demo:
>>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/
>>>>>>> collection1/conf/schema.xml#L34
>>>>>>> 
>>>>>>> The challenge is to figure out all the magic rules for that. You'd
>>>>>>> have to dig through the ICU documentation and other web pages. I
>>> found
>>>>>>> this one for example:
>>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system-
>>>>>>> transliterators-available-with-icu4j.html;jsessionid=
>>>>>>> BEAB0AF05A588B97B8A2393054D908C0
>>>>>>> 
>>>>>>> There is also 12 part series on Solr and Asian text processing,
>>> though
>>>>>>> it is a bit old now: http://discovery-grindstone.blogspot.com/
>>>>>>> 
>>>>>>> Hope one of these things help.
>>>>>>> 
>>>>>>> Regards,
>>>>>>>   Alex.
>>>>>>> 
>>>>>>> 
>>>>>>> On 20 July 2018 at 03:54, Amanda Shuman <amanda.shu...@gmail.com>
>>>>>> wrote:
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> We have a problem. Some of our historical documents have mixed
>>>>>> together
>>>>>>>> simplified and Chinese characters. There seems to be no problem
>>> when
>>>>>>>> searching either traditional or simplified separately - that is,
>>> if a
>>>>>>>> particular string/phrase is all in traditional or simplified, it
>>>>>> finds
>>>>>>> it -
>>>>>>>> but it does not find the string/phrase if the two different
>>>>>> characters
>>>>>>> (one
>>>>>>>> traditional, one simplified) are mixed together in the SAME
>>>>>>> string/phrase.
>>>>>>>> 
>>>>>>>> Has anyone ever handled this problem before? I know some
>> libraries
>>>>>> seem
>>>>>>> to
>>>>>>>> have implemented something that seems to be able to handle this,
>>> but
>>>>>> I'm
>>>>>>>> not sure how they did so!
>>>>>>>> 
>>>>>>>> Amanda
>>>>>>>> ------
>>>>>>>> Dr. Amanda Shuman
>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
>>>>>> Project
>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
>>>>>>>> PhD, University of California, Santa Cruz
>>>>>>>> http://www.amandashuman.net/
>>>>>>>> http://www.prchistoryresources.org/
>>>>>>>> Office: +49 (0) 761 203 4925
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: Question regarding searching Chinese characters

Reply via email to