Re: Question regarding searching Chinese characters

Susheel Kumar Fri, 20 Jul 2018 06:12:01 -0700

I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
each of A, B or C or D in query and they seems to be matching and CJKFF is
transforming the 舊 to 旧


On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <susheel2...@gmail.com>
wrote:

> Lack of my chinese language knowledge but if you want, I can do quick test
> for you in Analysis tab if you can give me what to put in index and query
> window...
>
> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <susheel2...@gmail.com>
> wrote:
>
>> Have you tried to use CJKFoldingFilter https://g
>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would cover
>> your use case but I am using this filter and so far no issues.
>>
>> Thnx
>>
>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <amanda.shu...@gmail.com>
>> wrote:
>>
>>> Thanks, Alex - I have seen a few of those links but never considered
>>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
>>> basically what is laid out in the old blogspot post, namely this point:
>>>
>>>
>>> "Why approach CJK resource discovery differently?
>>>
>>> 2.  Search results must be as script agnostic as possible.
>>>
>>> There is more than one way to write each word. "Simplified" characters
>>> were
>>> emphasized for printed materials in mainland China starting in the 1950s;
>>> "Traditional" characters were used in printed materials prior to the
>>> 1950s,
>>> and are still used in Taiwan, Hong Kong and Macau today.
>>> Since the characters are distinct, it's as if Chinese materials are
>>> written
>>> in two scripts.
>>> Another way to think about it:  every written Chinese word has at least
>>> two
>>> completely different spellings.  And it can be mix-n-match:  a word can
>>> be
>>> written with one traditional  and one simplified character.
>>> Example:   Given a user query 舊小說  (traditional for old fiction), the
>>> results should include matches for 舊小說 (traditional) and 旧小说 (simplified
>>> characters for old fiction)"
>>>
>>> So, using the example provided above, we are dealing with materials
>>> produced in the 1950s-1970s that do even weirder things like:
>>>
>>> A. 舊小說
>>>
>>> can also be
>>>
>>> B. 旧小说 (all simplified)
>>> or
>>> C. 旧小說 (first character simplified, last character traditional)
>>> or
>>> D. 舊小 说 (first character traditional, last character simplified)
>>>
>>> Thankfully the middle character was never simplified in recent times.
>>>
>>> From a historical standpoint, the mixed nature of the characters in the
>>> same word/phrase is because not all simplified characters were adopted at
>>> the same time by everyone uniformly (good times...).
>>>
>>> The problem seems to be that Solr can easily handle A or B above, but
>>> NOT C
>>> or D using the Smart Chinese analyzer. I'm not really sure how to change
>>> that at this point... maybe I should figure out how to contact the
>>> creators
>>> of the analyzer and ask them?
>>>
>>> Amanda
>>>
>>> ------
>>> Dr. Amanda Shuman
>>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>>> <http://www.maoistlegacy.uni-freiburg.de/>
>>> PhD, University of California, Santa Cruz
>>> http://www.amandashuman.net/
>>> http://www.prchistoryresources.org/
>>> Office: +49 (0) 761 203 4925
>>>
>>>
>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
>>> arafa...@gmail.com>
>>> wrote:
>>>
>>> > This is probably your start, if not read already:
>>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>>> >
>>> > Otherwise, I think your answer would be somewhere around using ICU4J,
>>> > IBM's library for dealing with Unicode: http://site.icu-project.org/
>>> > (mentioned on the same page above)
>>> > Specifically, transformations:
>>> > http://userguide.icu-project.org/transforms/general
>>> >
>>> > With that, maybe you map both alphabets into latin. I did that once
>>> > for Thai for a demo:
>>> > https://github.com/arafalov/solr-thai-test/blob/master/
>>> > collection1/conf/schema.xml#L34
>>> >
>>> > The challenge is to figure out all the magic rules for that. You'd
>>> > have to dig through the ICU documentation and other web pages. I found
>>> > this one for example:
>>> > http://avajava.com/tutorials/lessons/what-are-the-system-
>>> > transliterators-available-with-icu4j.html;jsessionid=
>>> > BEAB0AF05A588B97B8A2393054D908C0
>>> >
>>> > There is also 12 part series on Solr and Asian text processing, though
>>> > it is a bit old now: http://discovery-grindstone.blogspot.com/
>>> >
>>> > Hope one of these things help.
>>> >
>>> > Regards,
>>> >    Alex.
>>> >
>>> >
>>> > On 20 July 2018 at 03:54, Amanda Shuman <amanda.shu...@gmail.com>
>>> wrote:
>>> > > Hi all,
>>> > >
>>> > > We have a problem. Some of our historical documents have mixed
>>> together
>>> > > simplified and Chinese characters. There seems to be no problem when
>>> > > searching either traditional or simplified separately - that is, if a
>>> > > particular string/phrase is all in traditional or simplified, it
>>> finds
>>> > it -
>>> > > but it does not find the string/phrase if the two different
>>> characters
>>> > (one
>>> > > traditional, one simplified) are mixed together in the SAME
>>> > string/phrase.
>>> > >
>>> > > Has anyone ever handled this problem before? I know some libraries
>>> seem
>>> > to
>>> > > have implemented something that seems to be able to handle this, but
>>> I'm
>>> > > not sure how they did so!
>>> > >
>>> > > Amanda
>>> > > ------
>>> > > Dr. Amanda Shuman
>>> > > Post-doc researcher, University of Freiburg, The Maoist Legacy
>>> Project
>>> > > <http://www.maoistlegacy.uni-freiburg.de/>
>>> > > PhD, University of California, Santa Cruz
>>> > > http://www.amandashuman.net/
>>> > > http://www.prchistoryresources.org/
>>> > > Office: +49 (0) 761 203 4925
>>> >
>>>
>>
>>
>

Re: Question regarding searching Chinese characters

Reply via email to