Re: Question regarding searching Chinese characters

Tomoko Uchida Fri, 20 Jul 2018 11:03:47 -0700

Yes, while traditional - simplified transformation would be out of the
scope of Unicode normalization,
you would like to add ICUNormalizer2CharFilterFactory anyway :)


Let me refine my example settings:

<analyzer>
  <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>
  <tokenizer class="solr.HMMChineseTokenizerFactory"/>
  <filter class="solr.ICUTransformFilterFactory"
id="Traditional-Simplified"/>
</analyzer>

Regards,
Tomoko


2018年7月21日(土) 2:54 Alexandre Rafalovitch <arafa...@gmail.com>:

> Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
> template of what needs to be done.
>
> Regards,
>    Alex.
>
> On 20 July 2018 at 12:40, Walter Underwood <wun...@wunderwood.org> wrote:
> > Looks like we need a charfilter version of the ICU transforms. That
> could run before the tokenizer.
> >
> > I’ve never built a charfilter, but it seems like this would be a good
> first project for someone who wants to contribute.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <
> tomoko.uchida.1...@gmail.com> wrote:
> >>
> >> Exactly. More concretely, the starting point is: replacing your analyzer
> >>
> >> <analyzer
> class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
> >>
> >> to
> >>
> >> <analyzer>
> >>  <tokenizer class="solr.HMMChineseTokenizerFactory"/>
> >>  <filter class="solr.ICUTransformFilterFactory"
> >> id="Traditional-Simplified"/>
> >> </analyzer>
> >>
> >> and see if the results are as expected. Then research another filters if
> >> your requirements is not met.
> >>
> >> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
> >> characters as I noted previous in post, so ICUTransformFilterFactory is
> an
> >> incomplete workaround.
> >>
> >> 2018年7月21日(土) 0:05 Walter Underwood <wun...@wunderwood.org>:
> >>
> >>> I expect that this is the line that does the transformation:
> >>>
> >>>   <filter class="solr.ICUTransformFilterFactory"
> >>> id="Traditional-Simplified"/>
> >>>
> >>> This mapping is a standard feature of ICU. More info on ICU transforms
> is
> >>> in this doc, though not much detail on this particular transform.
> >>>
> >>> http://userguide.icu-project.org/transforms/general
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
> >>>> On Jul 20, 2018, at 7:43 AM, Susheel Kumar <susheel2...@gmail.com>
> >>> wrote:
> >>>>
> >>>> I think so.  I used the exact as in github
> >>>>
> >>>> <fieldType name="text_cjk" class="solr.TextField"
> >>>> positionIncrementGap="10000" autoGeneratePhraseQueries="false">
> >>>> <analyzer>
> >>>>   <tokenizer class="solr.ICUTokenizerFactory" />
> >>>>   <filter class="solr.CJKWidthFilterFactory"/>
> >>>>   <filter
> class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> >>>>   <filter class="solr.ICUTransformFilterFactory"
> >>> id="Traditional-Simplified"/>
> >>>>   <filter class="solr.ICUTransformFilterFactory"
> >>> id="Katakana-Hiragana"/>
> >>>>   <filter class="solr.ICUFoldingFilterFactory"/>
> >>>>   <filter class="solr.CJKBigramFilterFactory" han="true"
> >>>> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
> >>>> </analyzer>
> >>>> </fieldType>
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <
> amanda.shu...@gmail.com
> >>>>
> >>>> wrote:
> >>>>
> >>>>> Thanks! That does indeed look promising... This can be added on top
> of
> >>>>> Smart Chinese, right? Or is it an alternative?
> >>>>>
> >>>>>
> >>>>> ------
> >>>>> Dr. Amanda Shuman
> >>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> Project
> >>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> >>>>> PhD, University of California, Santa Cruz
> >>>>> http://www.amandashuman.net/
> >>>>> http://www.prchistoryresources.org/
> >>>>> Office: +49 (0) 761 203 4925
> >>>>>
> >>>>>
> >>>>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <
> susheel2...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
> >>> then
> >>>>>> each of A, B or C or D in query and they seems to be matching and
> CJKFF
> >>>>> is
> >>>>>> transforming the 舊 to 旧
> >>>>>>
> >>>>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <
> susheel2...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Lack of my chinese language knowledge but if you want, I can do
> quick
> >>>>>> test
> >>>>>>> for you in Analysis tab if you can give me what to put in index and
> >>>>> query
> >>>>>>> window...
> >>>>>>>
> >>>>>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <
> susheel2...@gmail.com
> >>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Have you tried to use CJKFoldingFilter https://g
> >>>>>>>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
> >>>>> cover
> >>>>>>>> your use case but I am using this filter and so far no issues.
> >>>>>>>>
> >>>>>>>> Thnx
> >>>>>>>>
> >>>>>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> >>>>> amanda.shu...@gmail.com
> >>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Thanks, Alex - I have seen a few of those links but never
> considered
> >>>>>>>>> transliteration! We use lucene's Smart Chinese analyzer. The
> issue
> >>> is
> >>>>>>>>> basically what is laid out in the old blogspot post, namely this
> >>>>> point:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> "Why approach CJK resource discovery differently?
> >>>>>>>>>
> >>>>>>>>> 2.  Search results must be as script agnostic as possible.
> >>>>>>>>>
> >>>>>>>>> There is more than one way to write each word. "Simplified"
> >>>>> characters
> >>>>>>>>> were
> >>>>>>>>> emphasized for printed materials in mainland China starting in
> the
> >>>>>> 1950s;
> >>>>>>>>> "Traditional" characters were used in printed materials prior to
> the
> >>>>>>>>> 1950s,
> >>>>>>>>> and are still used in Taiwan, Hong Kong and Macau today.
> >>>>>>>>> Since the characters are distinct, it's as if Chinese materials
> are
> >>>>>>>>> written
> >>>>>>>>> in two scripts.
> >>>>>>>>> Another way to think about it:  every written Chinese word has at
> >>>>> least
> >>>>>>>>> two
> >>>>>>>>> completely different spellings.  And it can be mix-n-match:  a
> word
> >>>>> can
> >>>>>>>>> be
> >>>>>>>>> written with one traditional  and one simplified character.
> >>>>>>>>> Example:   Given a user query 舊小說  (traditional for old fiction),
> >>> the
> >>>>>>>>> results should include matches for 舊小說 (traditional) and 旧小说
> >>>>>> (simplified
> >>>>>>>>> characters for old fiction)"
> >>>>>>>>>
> >>>>>>>>> So, using the example provided above, we are dealing with
> materials
> >>>>>>>>> produced in the 1950s-1970s that do even weirder things like:
> >>>>>>>>>
> >>>>>>>>> A. 舊小說
> >>>>>>>>>
> >>>>>>>>> can also be
> >>>>>>>>>
> >>>>>>>>> B. 旧小说 (all simplified)
> >>>>>>>>> or
> >>>>>>>>> C. 旧小說 (first character simplified, last character traditional)
> >>>>>>>>> or
> >>>>>>>>> D. 舊小 说 (first character traditional, last character simplified)
> >>>>>>>>>
> >>>>>>>>> Thankfully the middle character was never simplified in recent
> >>> times.
> >>>>>>>>>
> >>>>>>>>> From a historical standpoint, the mixed nature of the characters
> in
> >>>>> the
> >>>>>>>>> same word/phrase is because not all simplified characters were
> >>>>> adopted
> >>>>>> at
> >>>>>>>>> the same time by everyone uniformly (good times...).
> >>>>>>>>>
> >>>>>>>>> The problem seems to be that Solr can easily handle A or B above,
> >>> but
> >>>>>>>>> NOT C
> >>>>>>>>> or D using the Smart Chinese analyzer. I'm not really sure how to
> >>>>>> change
> >>>>>>>>> that at this point... maybe I should figure out how to contact
> the
> >>>>>>>>> creators
> >>>>>>>>> of the analyzer and ask them?
> >>>>>>>>>
> >>>>>>>>> Amanda
> >>>>>>>>>
> >>>>>>>>> ------
> >>>>>>>>> Dr. Amanda Shuman
> >>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> >>>>> Project
> >>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> >>>>>>>>> PhD, University of California, Santa Cruz
> >>>>>>>>> http://www.amandashuman.net/
> >>>>>>>>> http://www.prchistoryresources.org/
> >>>>>>>>> Office: +49 (0) 761 203 4925
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> >>>>>>>>> arafa...@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> This is probably your start, if not read already:
> >>>>>>>>>> https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> >>>>>>>>>>
> >>>>>>>>>> Otherwise, I think your answer would be somewhere around using
> >>>>> ICU4J,
> >>>>>>>>>> IBM's library for dealing with Unicode:
> >>>>> http://site.icu-project.org/
> >>>>>>>>>> (mentioned on the same page above)
> >>>>>>>>>> Specifically, transformations:
> >>>>>>>>>> http://userguide.icu-project.org/transforms/general
> >>>>>>>>>>
> >>>>>>>>>> With that, maybe you map both alphabets into latin. I did that
> once
> >>>>>>>>>> for Thai for a demo:
> >>>>>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/
> >>>>>>>>>> collection1/conf/schema.xml#L34
> >>>>>>>>>>
> >>>>>>>>>> The challenge is to figure out all the magic rules for that.
> You'd
> >>>>>>>>>> have to dig through the ICU documentation and other web pages. I
> >>>>>> found
> >>>>>>>>>> this one for example:
> >>>>>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system-
> >>>>>>>>>> transliterators-available-with-icu4j.html;jsessionid=
> >>>>>>>>>> BEAB0AF05A588B97B8A2393054D908C0
> >>>>>>>>>>
> >>>>>>>>>> There is also 12 part series on Solr and Asian text processing,
> >>>>>> though
> >>>>>>>>>> it is a bit old now: http://discovery-grindstone.blogspot.com/
> >>>>>>>>>>
> >>>>>>>>>> Hope one of these things help.
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>>  Alex.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 20 July 2018 at 03:54, Amanda Shuman <
> amanda.shu...@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>> Hi all,
> >>>>>>>>>>>
> >>>>>>>>>>> We have a problem. Some of our historical documents have mixed
> >>>>>>>>> together
> >>>>>>>>>>> simplified and Chinese characters. There seems to be no problem
> >>>>>> when
> >>>>>>>>>>> searching either traditional or simplified separately - that
> is,
> >>>>>> if a
> >>>>>>>>>>> particular string/phrase is all in traditional or simplified,
> it
> >>>>>>>>> finds
> >>>>>>>>>> it -
> >>>>>>>>>>> but it does not find the string/phrase if the two different
> >>>>>>>>> characters
> >>>>>>>>>> (one
> >>>>>>>>>>> traditional, one simplified) are mixed together in the SAME
> >>>>>>>>>> string/phrase.
> >>>>>>>>>>>
> >>>>>>>>>>> Has anyone ever handled this problem before? I know some
> >>>>> libraries
> >>>>>>>>> seem
> >>>>>>>>>> to
> >>>>>>>>>>> have implemented something that seems to be able to handle
> this,
> >>>>>> but
> >>>>>>>>> I'm
> >>>>>>>>>>> not sure how they did so!
> >>>>>>>>>>>
> >>>>>>>>>>> Amanda
> >>>>>>>>>>> ------
> >>>>>>>>>>> Dr. Amanda Shuman
> >>>>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> >>>>>>>>> Project
> >>>>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> >>>>>>>>>>> PhD, University of California, Santa Cruz
> >>>>>>>>>>> http://www.amandashuman.net/
> >>>>>>>>>>> http://www.prchistoryresources.org/
> >>>>>>>>>>> Office: +49 (0) 761 203 4925
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>>
> >>
> >> --
> >> Tomoko Uchida
> >
>


-- 
Tomoko Uchida

Re: Question regarding searching Chinese characters

Reply via email to