Re: Question regarding searching Chinese characters

Tomoko Uchida Fri, 20 Jul 2018 07:42:27 -0700

Hi,

There is ICUTransformFilter (that included Solr distribution) which also
should be work for you.
See the example settings:
https://lucene.apache.org/solr/guide/7_4/filter-descriptions.html#icu-transform-filter


Combine it with HMMChineseTokenizer.
https://lucene.apache.org/solr/guide/7_4/language-analysis.html#hmm-chinese-tokenizer

In other words, replace your SmartChineseAnalyzer settings by
HMMChineseTokenizer
& ICUTransformFilter pipeline.

----
Here is a bit complicated explanation, so you can skip if you do not want
to go into analyzer details.

I do not understand Chinese, but seems there are no easy or one-stop
solutions in my view. (As Japanese, we have similar problems with Chinese.)

HMMChineseTokenizer expects Simplified Chinese text.
See:
https://lucene.apache.org/core/7_4_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizer.html

So you should transform all traditional Chinese characters **before**
applying HMMChineseTokenizer by CharFilters, otherwise the Tokenizer do not
correctly work.

Unfortunately, there is no such CharFilters as far as I know.
ICUNormalizer2CharFilter do not handle such transformation so it is no
help. CJKFoldingFilter and  ICUTransformFilter do the
traditional-simplified transformation, however, they are TokenFilters that
works after applying a Tokenizer.

I think you need two steps if you want to use HMMChineseTokenizer correctly.

1. transform all traditional characters to simplified ones and save to
temporary files.
    I do not have clear idea for doing this, but you can create a Java
program that calls Lucene's ICUTransformFilter
2. then, index to Solr using SmartChineseAnalyzer.

Regards,
Tomoko

2018年7月20日(金) 22:12 Susheel Kumar <susheel2...@gmail.com>:

> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
> each of A, B or C or D in query and they seems to be matching and CJKFF is
> transforming the 舊 to 旧
>
> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <susheel2...@gmail.com>
> wrote:
>
> > Lack of my chinese language knowledge but if you want, I can do quick
> test
> > for you in Analysis tab if you can give me what to put in index and query
> > window...
> >
> > On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <susheel2...@gmail.com>
> > wrote:
> >
> >> Have you tried to use CJKFoldingFilter https://g
> >> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would cover
> >> your use case but I am using this filter and so far no issues.
> >>
> >> Thnx
> >>
> >> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <amanda.shu...@gmail.com
> >
> >> wrote:
> >>
> >>> Thanks, Alex - I have seen a few of those links but never considered
> >>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> >>> basically what is laid out in the old blogspot post, namely this point:
> >>>
> >>>
> >>> "Why approach CJK resource discovery differently?
> >>>
> >>> 2.  Search results must be as script agnostic as possible.
> >>>
> >>> There is more than one way to write each word. "Simplified" characters
> >>> were
> >>> emphasized for printed materials in mainland China starting in the
> 1950s;
> >>> "Traditional" characters were used in printed materials prior to the
> >>> 1950s,
> >>> and are still used in Taiwan, Hong Kong and Macau today.
> >>> Since the characters are distinct, it's as if Chinese materials are
> >>> written
> >>> in two scripts.
> >>> Another way to think about it:  every written Chinese word has at least
> >>> two
> >>> completely different spellings.  And it can be mix-n-match:  a word can
> >>> be
> >>> written with one traditional  and one simplified character.
> >>> Example:   Given a user query 舊小說  (traditional for old fiction), the
> >>> results should include matches for 舊小說 (traditional) and 旧小说
> (simplified
> >>> characters for old fiction)"
> >>>
> >>> So, using the example provided above, we are dealing with materials
> >>> produced in the 1950s-1970s that do even weirder things like:
> >>>
> >>> A. 舊小說
> >>>
> >>> can also be
> >>>
> >>> B. 旧小说 (all simplified)
> >>> or
> >>> C. 旧小說 (first character simplified, last character traditional)
> >>> or
> >>> D. 舊小 说 (first character traditional, last character simplified)
> >>>
> >>> Thankfully the middle character was never simplified in recent times.
> >>>
> >>> From a historical standpoint, the mixed nature of the characters in the
> >>> same word/phrase is because not all simplified characters were adopted
> at
> >>> the same time by everyone uniformly (good times...).
> >>>
> >>> The problem seems to be that Solr can easily handle A or B above, but
> >>> NOT C
> >>> or D using the Smart Chinese analyzer. I'm not really sure how to
> change
> >>> that at this point... maybe I should figure out how to contact the
> >>> creators
> >>> of the analyzer and ask them?
> >>>
> >>> Amanda
> >>>
> >>> ------
> >>> Dr. Amanda Shuman
> >>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> >>> <http://www.maoistlegacy.uni-freiburg.de/>
> >>> PhD, University of California, Santa Cruz
> >>> http://www.amandashuman.net/
> >>> http://www.prchistoryresources.org/
> >>> Office: +49 (0) 761 203 4925
> >>>
> >>>
> >>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> >>> arafa...@gmail.com>
> >>> wrote:
> >>>
> >>> > This is probably your start, if not read already:
> >>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> >>> >
> >>> > Otherwise, I think your answer would be somewhere around using ICU4J,
> >>> > IBM's library for dealing with Unicode: http://site.icu-project.org/
> >>> > (mentioned on the same page above)
> >>> > Specifically, transformations:
> >>> > http://userguide.icu-project.org/transforms/general
> >>> >
> >>> > With that, maybe you map both alphabets into latin. I did that once
> >>> > for Thai for a demo:
> >>> > https://github.com/arafalov/solr-thai-test/blob/master/
> >>> > collection1/conf/schema.xml#L34
> >>> >
> >>> > The challenge is to figure out all the magic rules for that. You'd
> >>> > have to dig through the ICU documentation and other web pages. I
> found
> >>> > this one for example:
> >>> > http://avajava.com/tutorials/lessons/what-are-the-system-
> >>> > transliterators-available-with-icu4j.html;jsessionid=
> >>> > BEAB0AF05A588B97B8A2393054D908C0
> >>> >
> >>> > There is also 12 part series on Solr and Asian text processing,
> though
> >>> > it is a bit old now: http://discovery-grindstone.blogspot.com/
> >>> >
> >>> > Hope one of these things help.
> >>> >
> >>> > Regards,
> >>> >    Alex.
> >>> >
> >>> >
> >>> > On 20 July 2018 at 03:54, Amanda Shuman <amanda.shu...@gmail.com>
> >>> wrote:
> >>> > > Hi all,
> >>> > >
> >>> > > We have a problem. Some of our historical documents have mixed
> >>> together
> >>> > > simplified and Chinese characters. There seems to be no problem
> when
> >>> > > searching either traditional or simplified separately - that is,
> if a
> >>> > > particular string/phrase is all in traditional or simplified, it
> >>> finds
> >>> > it -
> >>> > > but it does not find the string/phrase if the two different
> >>> characters
> >>> > (one
> >>> > > traditional, one simplified) are mixed together in the SAME
> >>> > string/phrase.
> >>> > >
> >>> > > Has anyone ever handled this problem before? I know some libraries
> >>> seem
> >>> > to
> >>> > > have implemented something that seems to be able to handle this,
> but
> >>> I'm
> >>> > > not sure how they did so!
> >>> > >
> >>> > > Amanda
> >>> > > ------
> >>> > > Dr. Amanda Shuman
> >>> > > Post-doc researcher, University of Freiburg, The Maoist Legacy
> >>> Project
> >>> > > <http://www.maoistlegacy.uni-freiburg.de/>
> >>> > > PhD, University of California, Santa Cruz
> >>> > > http://www.amandashuman.net/
> >>> > > http://www.prchistoryresources.org/
> >>> > > Office: +49 (0) 761 203 4925
> >>> >
> >>>
> >>
> >>
> >
>


-- 
Tomoko Uchida

Re: Question regarding searching Chinese characters

Reply via email to