Re: Question regarding searching Chinese characters

Amanda Shuman Tue, 24 Jul 2018 05:08:58 -0700

Hi Tomoko,

Thanks so much for this explanation - I did not even know this was
possible! I will try it out but I have one question: do all I need to do is
modify the settings from smartChinese to the ones you posted here:


<analyzer>
  <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>
  <tokenizer class="solr.HMMChineseTokenizerFactory"/>
  <filter class="solr.ICUTransformFilterFactory"
id="Traditional-Simplified"/>
</analyzer>

Or do I need to still do something with the SmartChineseAnalyzer? I did not
quite understand this in your first message:

" I think you need two steps if you want to use HMMChineseTokenizer
correctly.

1. transform all traditional characters to simplified ones and save to
temporary files.
    I do not have clear idea for doing this, but you can create a Java
program that calls Lucene's ICUTransformFilter
2. then, index to Solr using SmartChineseAnalyzer."

My understanding is that with the new settings you posted, I don't need to
do these steps. Is that correct? Otherwise, I don't really know how to do
step 1 with the java program....

Thanks!
Amanda


------
Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project
<http://www.maoistlegacy.uni-freiburg.de/>
PhD, University of California, Santa Cruz
http://www.amandashuman.net/
http://www.prchistoryresources.org/
Office: +49 (0) 761 203 4925


On Fri, Jul 20, 2018 at 8:03 PM, Tomoko Uchida <tomoko.uchida.1...@gmail.com
> wrote:

> Yes, while traditional - simplified transformation would be out of the
> scope of Unicode normalization,
> you would like to add ICUNormalizer2CharFilterFactory anyway :)
>
> Let me refine my example settings:
>
> <analyzer>
>   <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>
>   <tokenizer class="solr.HMMChineseTokenizerFactory"/>
>   <filter class="solr.ICUTransformFilterFactory"
> id="Traditional-Simplified"/>
> </analyzer>
>
> Regards,
> Tomoko
>
>
> 2018年7月21日(土) 2:54 Alexandre Rafalovitch <arafa...@gmail.com>:
>
> > Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
> > template of what needs to be done.
> >
> > Regards,
> >    Alex.
> >
> > On 20 July 2018 at 12:40, Walter Underwood <wun...@wunderwood.org>
> wrote:
> > > Looks like we need a charfilter version of the ICU transforms. That
> > could run before the tokenizer.
> > >
> > > I’ve never built a charfilter, but it seems like this would be a good
> > first project for someone who wants to contribute.
> > >
> > > wunder
> > > Walter Underwood
> > > wun...@wunderwood.org
> > > http://observer.wunderwood.org/  (my blog)
> > >
> > >> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <
> > tomoko.uchida.1...@gmail.com> wrote:
> > >>
> > >> Exactly. More concretely, the starting point is: replacing your
> analyzer
> > >>
> > >> <analyzer
> > class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
> > >>
> > >> to
> > >>
> > >> <analyzer>
> > >>  <tokenizer class="solr.HMMChineseTokenizerFactory"/>
> > >>  <filter class="solr.ICUTransformFilterFactory"
> > >> id="Traditional-Simplified"/>
> > >> </analyzer>
> > >>
> > >> and see if the results are as expected. Then research another filters
> if
> > >> your requirements is not met.
> > >>
> > >> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
> > >> characters as I noted previous in post, so ICUTransformFilterFactory
> is
> > an
> > >> incomplete workaround.
> > >>
> > >> 2018年7月21日(土) 0:05 Walter Underwood <wun...@wunderwood.org>:
> > >>
> > >>> I expect that this is the line that does the transformation:
> > >>>
> > >>>   <filter class="solr.ICUTransformFilterFactory"
> > >>> id="Traditional-Simplified"/>
> > >>>
> > >>> This mapping is a standard feature of ICU. More info on ICU
> transforms
> > is
> > >>> in this doc, though not much detail on this particular transform.
> > >>>
> > >>> http://userguide.icu-project.org/transforms/general
> > >>>
> > >>> wunder
> > >>> Walter Underwood
> > >>> wun...@wunderwood.org
> > >>> http://observer.wunderwood.org/  (my blog)
> > >>>
> > >>>> On Jul 20, 2018, at 7:43 AM, Susheel Kumar <susheel2...@gmail.com>
> > >>> wrote:
> > >>>>
> > >>>> I think so.  I used the exact as in github
> > >>>>
> > >>>> <fieldType name="text_cjk" class="solr.TextField"
> > >>>> positionIncrementGap="10000" autoGeneratePhraseQueries="false">
> > >>>> <analyzer>
> > >>>>   <tokenizer class="solr.ICUTokenizerFactory" />
> > >>>>   <filter class="solr.CJKWidthFilterFactory"/>
> > >>>>   <filter
> > class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> > >>>>   <filter class="solr.ICUTransformFilterFactory"
> > >>> id="Traditional-Simplified"/>
> > >>>>   <filter class="solr.ICUTransformFilterFactory"
> > >>> id="Katakana-Hiragana"/>
> > >>>>   <filter class="solr.ICUFoldingFilterFactory"/>
> > >>>>   <filter class="solr.CJKBigramFilterFactory" han="true"
> > >>>> hiragana="true" katakana="true" hangul="true" outputUnigrams="true"
> />
> > >>>> </analyzer>
> > >>>> </fieldType>
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <
> > amanda.shu...@gmail.com
> > >>>>
> > >>>> wrote:
> > >>>>
> > >>>>> Thanks! That does indeed look promising... This can be added on top
> > of
> > >>>>> Smart Chinese, right? Or is it an alternative?
> > >>>>>
> > >>>>>
> > >>>>> ------
> > >>>>> Dr. Amanda Shuman
> > >>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> > Project
> > >>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> > >>>>> PhD, University of California, Santa Cruz
> > >>>>> http://www.amandashuman.net/
> > >>>>> http://www.prchistoryresources.org/
> > >>>>> Office: +49 (0) 761 203 4925
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <
> > susheel2...@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index
> and
> > >>> then
> > >>>>>> each of A, B or C or D in query and they seems to be matching and
> > CJKFF
> > >>>>> is
> > >>>>>> transforming the 舊 to 旧
> > >>>>>>
> > >>>>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <
> > susheel2...@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Lack of my chinese language knowledge but if you want, I can do
> > quick
> > >>>>>> test
> > >>>>>>> for you in Analysis tab if you can give me what to put in index
> and
> > >>>>> query
> > >>>>>>> window...
> > >>>>>>>
> > >>>>>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <
> > susheel2...@gmail.com
> > >>>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Have you tried to use CJKFoldingFilter https://g
> > >>>>>>>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this
> would
> > >>>>> cover
> > >>>>>>>> your use case but I am using this filter and so far no issues.
> > >>>>>>>>
> > >>>>>>>> Thnx
> > >>>>>>>>
> > >>>>>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> > >>>>> amanda.shu...@gmail.com
> > >>>>>>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Thanks, Alex - I have seen a few of those links but never
> > considered
> > >>>>>>>>> transliteration! We use lucene's Smart Chinese analyzer. The
> > issue
> > >>> is
> > >>>>>>>>> basically what is laid out in the old blogspot post, namely
> this
> > >>>>> point:
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> "Why approach CJK resource discovery differently?
> > >>>>>>>>>
> > >>>>>>>>> 2.  Search results must be as script agnostic as possible.
> > >>>>>>>>>
> > >>>>>>>>> There is more than one way to write each word. "Simplified"
> > >>>>> characters
> > >>>>>>>>> were
> > >>>>>>>>> emphasized for printed materials in mainland China starting in
> > the
> > >>>>>> 1950s;
> > >>>>>>>>> "Traditional" characters were used in printed materials prior
> to
> > the
> > >>>>>>>>> 1950s,
> > >>>>>>>>> and are still used in Taiwan, Hong Kong and Macau today.
> > >>>>>>>>> Since the characters are distinct, it's as if Chinese materials
> > are
> > >>>>>>>>> written
> > >>>>>>>>> in two scripts.
> > >>>>>>>>> Another way to think about it:  every written Chinese word has
> at
> > >>>>> least
> > >>>>>>>>> two
> > >>>>>>>>> completely different spellings.  And it can be mix-n-match:  a
> > word
> > >>>>> can
> > >>>>>>>>> be
> > >>>>>>>>> written with one traditional  and one simplified character.
> > >>>>>>>>> Example:   Given a user query 舊小說  (traditional for old
> fiction),
> > >>> the
> > >>>>>>>>> results should include matches for 舊小說 (traditional) and 旧小说
> > >>>>>> (simplified
> > >>>>>>>>> characters for old fiction)"
> > >>>>>>>>>
> > >>>>>>>>> So, using the example provided above, we are dealing with
> > materials
> > >>>>>>>>> produced in the 1950s-1970s that do even weirder things like:
> > >>>>>>>>>
> > >>>>>>>>> A. 舊小說
> > >>>>>>>>>
> > >>>>>>>>> can also be
> > >>>>>>>>>
> > >>>>>>>>> B. 旧小说 (all simplified)
> > >>>>>>>>> or
> > >>>>>>>>> C. 旧小說 (first character simplified, last character traditional)
> > >>>>>>>>> or
> > >>>>>>>>> D. 舊小 说 (first character traditional, last character
> simplified)
> > >>>>>>>>>
> > >>>>>>>>> Thankfully the middle character was never simplified in recent
> > >>> times.
> > >>>>>>>>>
> > >>>>>>>>> From a historical standpoint, the mixed nature of the
> characters
> > in
> > >>>>> the
> > >>>>>>>>> same word/phrase is because not all simplified characters were
> > >>>>> adopted
> > >>>>>> at
> > >>>>>>>>> the same time by everyone uniformly (good times...).
> > >>>>>>>>>
> > >>>>>>>>> The problem seems to be that Solr can easily handle A or B
> above,
> > >>> but
> > >>>>>>>>> NOT C
> > >>>>>>>>> or D using the Smart Chinese analyzer. I'm not really sure how
> to
> > >>>>>> change
> > >>>>>>>>> that at this point... maybe I should figure out how to contact
> > the
> > >>>>>>>>> creators
> > >>>>>>>>> of the analyzer and ask them?
> > >>>>>>>>>
> > >>>>>>>>> Amanda
> > >>>>>>>>>
> > >>>>>>>>> ------
> > >>>>>>>>> Dr. Amanda Shuman
> > >>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> > >>>>> Project
> > >>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> > >>>>>>>>> PhD, University of California, Santa Cruz
> > >>>>>>>>> http://www.amandashuman.net/
> > >>>>>>>>> http://www.prchistoryresources.org/
> > >>>>>>>>> Office: +49 (0) 761 203 4925
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> > >>>>>>>>> arafa...@gmail.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> This is probably your start, if not read already:
> > >>>>>>>>>> https://lucene.apache.org/solr/guide/7_4/language-
> analysis.html
> > >>>>>>>>>>
> > >>>>>>>>>> Otherwise, I think your answer would be somewhere around using
> > >>>>> ICU4J,
> > >>>>>>>>>> IBM's library for dealing with Unicode:
> > >>>>> http://site.icu-project.org/
> > >>>>>>>>>> (mentioned on the same page above)
> > >>>>>>>>>> Specifically, transformations:
> > >>>>>>>>>> http://userguide.icu-project.org/transforms/general
> > >>>>>>>>>>
> > >>>>>>>>>> With that, maybe you map both alphabets into latin. I did that
> > once
> > >>>>>>>>>> for Thai for a demo:
> > >>>>>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/
> > >>>>>>>>>> collection1/conf/schema.xml#L34
> > >>>>>>>>>>
> > >>>>>>>>>> The challenge is to figure out all the magic rules for that.
> > You'd
> > >>>>>>>>>> have to dig through the ICU documentation and other web
> pages. I
> > >>>>>> found
> > >>>>>>>>>> this one for example:
> > >>>>>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system-
> > >>>>>>>>>> transliterators-available-with-icu4j.html;jsessionid=
> > >>>>>>>>>> BEAB0AF05A588B97B8A2393054D908C0
> > >>>>>>>>>>
> > >>>>>>>>>> There is also 12 part series on Solr and Asian text
> processing,
> > >>>>>> though
> > >>>>>>>>>> it is a bit old now: http://discovery-grindstone.
> blogspot.com/
> > >>>>>>>>>>
> > >>>>>>>>>> Hope one of these things help.
> > >>>>>>>>>>
> > >>>>>>>>>> Regards,
> > >>>>>>>>>>  Alex.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On 20 July 2018 at 03:54, Amanda Shuman <
> > amanda.shu...@gmail.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>> Hi all,
> > >>>>>>>>>>>
> > >>>>>>>>>>> We have a problem. Some of our historical documents have
> mixed
> > >>>>>>>>> together
> > >>>>>>>>>>> simplified and Chinese characters. There seems to be no
> problem
> > >>>>>> when
> > >>>>>>>>>>> searching either traditional or simplified separately - that
> > is,
> > >>>>>> if a
> > >>>>>>>>>>> particular string/phrase is all in traditional or simplified,
> > it
> > >>>>>>>>> finds
> > >>>>>>>>>> it -
> > >>>>>>>>>>> but it does not find the string/phrase if the two different
> > >>>>>>>>> characters
> > >>>>>>>>>> (one
> > >>>>>>>>>>> traditional, one simplified) are mixed together in the SAME
> > >>>>>>>>>> string/phrase.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Has anyone ever handled this problem before? I know some
> > >>>>> libraries
> > >>>>>>>>> seem
> > >>>>>>>>>> to
> > >>>>>>>>>>> have implemented something that seems to be able to handle
> > this,
> > >>>>>> but
> > >>>>>>>>> I'm
> > >>>>>>>>>>> not sure how they did so!
> > >>>>>>>>>>>
> > >>>>>>>>>>> Amanda
> > >>>>>>>>>>> ------
> > >>>>>>>>>>> Dr. Amanda Shuman
> > >>>>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist
> Legacy
> > >>>>>>>>> Project
> > >>>>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> > >>>>>>>>>>> PhD, University of California, Santa Cruz
> > >>>>>>>>>>> http://www.amandashuman.net/
> > >>>>>>>>>>> http://www.prchistoryresources.org/
> > >>>>>>>>>>> Office: +49 (0) 761 203 4925
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>>
> > >>
> > >> --
> > >> Tomoko Uchida
> > >
> >
>
>
> --
> Tomoko Uchida
>

Re: Question regarding searching Chinese characters

Reply via email to