Re: Question regarding searching Chinese characters

Tomoko Uchida Fri, 20 Jul 2018 08:25:25 -0700

Exactly. More concretely, the starting point is: replacing your analyzer

<analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>


to

<analyzer>
  <tokenizer class="solr.HMMChineseTokenizerFactory"/>
  <filter class="solr.ICUTransformFilterFactory"
id="Traditional-Simplified"/>
</analyzer>

and see if the results are as expected. Then research another filters if
your requirements is not met.

Just a reminder: HMMChineseTokenizerFactory do not handle traditional
characters as I noted previous in post, so ICUTransformFilterFactory is an
incomplete workaround.

2018年7月21日(土) 0:05 Walter Underwood <wun...@wunderwood.org>:

> I expect that this is the line that does the transformation:
>
>    <filter class="solr.ICUTransformFilterFactory"
> id="Traditional-Simplified"/>
>
> This mapping is a standard feature of ICU. More info on ICU transforms is
> in this doc, though not much detail on this particular transform.
>
> http://userguide.icu-project.org/transforms/general
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jul 20, 2018, at 7:43 AM, Susheel Kumar <susheel2...@gmail.com>
> wrote:
> >
> > I think so.  I used the exact as in github
> >
> > <fieldType name="text_cjk" class="solr.TextField"
> > positionIncrementGap="10000" autoGeneratePhraseQueries="false">
> >  <analyzer>
> >    <tokenizer class="solr.ICUTokenizerFactory" />
> >    <filter class="solr.CJKWidthFilterFactory"/>
> >    <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> >    <filter class="solr.ICUTransformFilterFactory"
> id="Traditional-Simplified"/>
> >    <filter class="solr.ICUTransformFilterFactory"
> id="Katakana-Hiragana"/>
> >    <filter class="solr.ICUFoldingFilterFactory"/>
> >    <filter class="solr.CJKBigramFilterFactory" han="true"
> > hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
> >  </analyzer>
> > </fieldType>
> >
> >
> >
> > On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <amanda.shu...@gmail.com
> >
> > wrote:
> >
> >> Thanks! That does indeed look promising... This can be added on top of
> >> Smart Chinese, right? Or is it an alternative?
> >>
> >>
> >> ------
> >> Dr. Amanda Shuman
> >> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> >> <http://www.maoistlegacy.uni-freiburg.de/>
> >> PhD, University of California, Santa Cruz
> >> http://www.amandashuman.net/
> >> http://www.prchistoryresources.org/
> >> Office: +49 (0) 761 203 4925
> >>
> >>
> >> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <susheel2...@gmail.com>
> >> wrote:
> >>
> >>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
> then
> >>> each of A, B or C or D in query and they seems to be matching and CJKFF
> >> is
> >>> transforming the 舊 to 旧
> >>>
> >>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <susheel2...@gmail.com>
> >>> wrote:
> >>>
> >>>> Lack of my chinese language knowledge but if you want, I can do quick
> >>> test
> >>>> for you in Analysis tab if you can give me what to put in index and
> >> query
> >>>> window...
> >>>>
> >>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <susheel2...@gmail.com
> >
> >>>> wrote:
> >>>>
> >>>>> Have you tried to use CJKFoldingFilter https://g
> >>>>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
> >> cover
> >>>>> your use case but I am using this filter and so far no issues.
> >>>>>
> >>>>> Thnx
> >>>>>
> >>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> >> amanda.shu...@gmail.com
> >>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Thanks, Alex - I have seen a few of those links but never considered
> >>>>>> transliteration! We use lucene's Smart Chinese analyzer. The issue
> is
> >>>>>> basically what is laid out in the old blogspot post, namely this
> >> point:
> >>>>>>
> >>>>>>
> >>>>>> "Why approach CJK resource discovery differently?
> >>>>>>
> >>>>>> 2.  Search results must be as script agnostic as possible.
> >>>>>>
> >>>>>> There is more than one way to write each word. "Simplified"
> >> characters
> >>>>>> were
> >>>>>> emphasized for printed materials in mainland China starting in the
> >>> 1950s;
> >>>>>> "Traditional" characters were used in printed materials prior to the
> >>>>>> 1950s,
> >>>>>> and are still used in Taiwan, Hong Kong and Macau today.
> >>>>>> Since the characters are distinct, it's as if Chinese materials are
> >>>>>> written
> >>>>>> in two scripts.
> >>>>>> Another way to think about it:  every written Chinese word has at
> >> least
> >>>>>> two
> >>>>>> completely different spellings.  And it can be mix-n-match:  a word
> >> can
> >>>>>> be
> >>>>>> written with one traditional  and one simplified character.
> >>>>>> Example:   Given a user query 舊小說  (traditional for old fiction),
> the
> >>>>>> results should include matches for 舊小說 (traditional) and 旧小说
> >>> (simplified
> >>>>>> characters for old fiction)"
> >>>>>>
> >>>>>> So, using the example provided above, we are dealing with materials
> >>>>>> produced in the 1950s-1970s that do even weirder things like:
> >>>>>>
> >>>>>> A. 舊小說
> >>>>>>
> >>>>>> can also be
> >>>>>>
> >>>>>> B. 旧小说 (all simplified)
> >>>>>> or
> >>>>>> C. 旧小說 (first character simplified, last character traditional)
> >>>>>> or
> >>>>>> D. 舊小 说 (first character traditional, last character simplified)
> >>>>>>
> >>>>>> Thankfully the middle character was never simplified in recent
> times.
> >>>>>>
> >>>>>> From a historical standpoint, the mixed nature of the characters in
> >> the
> >>>>>> same word/phrase is because not all simplified characters were
> >> adopted
> >>> at
> >>>>>> the same time by everyone uniformly (good times...).
> >>>>>>
> >>>>>> The problem seems to be that Solr can easily handle A or B above,
> but
> >>>>>> NOT C
> >>>>>> or D using the Smart Chinese analyzer. I'm not really sure how to
> >>> change
> >>>>>> that at this point... maybe I should figure out how to contact the
> >>>>>> creators
> >>>>>> of the analyzer and ask them?
> >>>>>>
> >>>>>> Amanda
> >>>>>>
> >>>>>> ------
> >>>>>> Dr. Amanda Shuman
> >>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> >> Project
> >>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> >>>>>> PhD, University of California, Santa Cruz
> >>>>>> http://www.amandashuman.net/
> >>>>>> http://www.prchistoryresources.org/
> >>>>>> Office: +49 (0) 761 203 4925
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> >>>>>> arafa...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> This is probably your start, if not read already:
> >>>>>>> https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> >>>>>>>
> >>>>>>> Otherwise, I think your answer would be somewhere around using
> >> ICU4J,
> >>>>>>> IBM's library for dealing with Unicode:
> >> http://site.icu-project.org/
> >>>>>>> (mentioned on the same page above)
> >>>>>>> Specifically, transformations:
> >>>>>>> http://userguide.icu-project.org/transforms/general
> >>>>>>>
> >>>>>>> With that, maybe you map both alphabets into latin. I did that once
> >>>>>>> for Thai for a demo:
> >>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/
> >>>>>>> collection1/conf/schema.xml#L34
> >>>>>>>
> >>>>>>> The challenge is to figure out all the magic rules for that. You'd
> >>>>>>> have to dig through the ICU documentation and other web pages. I
> >>> found
> >>>>>>> this one for example:
> >>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system-
> >>>>>>> transliterators-available-with-icu4j.html;jsessionid=
> >>>>>>> BEAB0AF05A588B97B8A2393054D908C0
> >>>>>>>
> >>>>>>> There is also 12 part series on Solr and Asian text processing,
> >>> though
> >>>>>>> it is a bit old now: http://discovery-grindstone.blogspot.com/
> >>>>>>>
> >>>>>>> Hope one of these things help.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>>   Alex.
> >>>>>>>
> >>>>>>>
> >>>>>>> On 20 July 2018 at 03:54, Amanda Shuman <amanda.shu...@gmail.com>
> >>>>>> wrote:
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> We have a problem. Some of our historical documents have mixed
> >>>>>> together
> >>>>>>>> simplified and Chinese characters. There seems to be no problem
> >>> when
> >>>>>>>> searching either traditional or simplified separately - that is,
> >>> if a
> >>>>>>>> particular string/phrase is all in traditional or simplified, it
> >>>>>> finds
> >>>>>>> it -
> >>>>>>>> but it does not find the string/phrase if the two different
> >>>>>> characters
> >>>>>>> (one
> >>>>>>>> traditional, one simplified) are mixed together in the SAME
> >>>>>>> string/phrase.
> >>>>>>>>
> >>>>>>>> Has anyone ever handled this problem before? I know some
> >> libraries
> >>>>>> seem
> >>>>>>> to
> >>>>>>>> have implemented something that seems to be able to handle this,
> >>> but
> >>>>>> I'm
> >>>>>>>> not sure how they did so!
> >>>>>>>>
> >>>>>>>> Amanda
> >>>>>>>> ------
> >>>>>>>> Dr. Amanda Shuman
> >>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> >>>>>> Project
> >>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> >>>>>>>> PhD, University of California, Santa Cruz
> >>>>>>>> http://www.amandashuman.net/
> >>>>>>>> http://www.prchistoryresources.org/
> >>>>>>>> Office: +49 (0) 761 203 4925
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

-- 
Tomoko Uchida

Re: Question regarding searching Chinese characters

Reply via email to