Exactly. More concretely, the starting point is: replacing your analyzer <analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
to <analyzer> <tokenizer class="solr.HMMChineseTokenizerFactory"/> <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/> </analyzer> and see if the results are as expected. Then research another filters if your requirements is not met. Just a reminder: HMMChineseTokenizerFactory do not handle traditional characters as I noted previous in post, so ICUTransformFilterFactory is an incomplete workaround. 2018年7月21日(土) 0:05 Walter Underwood <wun...@wunderwood.org>: > I expect that this is the line that does the transformation: > > <filter class="solr.ICUTransformFilterFactory" > id="Traditional-Simplified"/> > > This mapping is a standard feature of ICU. More info on ICU transforms is > in this doc, though not much detail on this particular transform. > > http://userguide.icu-project.org/transforms/general > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > On Jul 20, 2018, at 7:43 AM, Susheel Kumar <susheel2...@gmail.com> > wrote: > > > > I think so. I used the exact as in github > > > > <fieldType name="text_cjk" class="solr.TextField" > > positionIncrementGap="10000" autoGeneratePhraseQueries="false"> > > <analyzer> > > <tokenizer class="solr.ICUTokenizerFactory" /> > > <filter class="solr.CJKWidthFilterFactory"/> > > <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/> > > <filter class="solr.ICUTransformFilterFactory" > id="Traditional-Simplified"/> > > <filter class="solr.ICUTransformFilterFactory" > id="Katakana-Hiragana"/> > > <filter class="solr.ICUFoldingFilterFactory"/> > > <filter class="solr.CJKBigramFilterFactory" han="true" > > hiragana="true" katakana="true" hangul="true" outputUnigrams="true" /> > > </analyzer> > > </fieldType> > > > > > > > > On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <amanda.shu...@gmail.com > > > > wrote: > > > >> Thanks! That does indeed look promising... This can be added on top of > >> Smart Chinese, right? Or is it an alternative? > >> > >> > >> ------ > >> Dr. Amanda Shuman > >> Post-doc researcher, University of Freiburg, The Maoist Legacy Project > >> <http://www.maoistlegacy.uni-freiburg.de/> > >> PhD, University of California, Santa Cruz > >> http://www.amandashuman.net/ > >> http://www.prchistoryresources.org/ > >> Office: +49 (0) 761 203 4925 > >> > >> > >> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <susheel2...@gmail.com> > >> wrote: > >> > >>> I think CJKFoldingFilter will work for you. I put 舊小說 in index and > then > >>> each of A, B or C or D in query and they seems to be matching and CJKFF > >> is > >>> transforming the 舊 to 旧 > >>> > >>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <susheel2...@gmail.com> > >>> wrote: > >>> > >>>> Lack of my chinese language knowledge but if you want, I can do quick > >>> test > >>>> for you in Analysis tab if you can give me what to put in index and > >> query > >>>> window... > >>>> > >>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <susheel2...@gmail.com > > > >>>> wrote: > >>>> > >>>>> Have you tried to use CJKFoldingFilter https://g > >>>>> ithub.com/sul-dlss/CJKFoldingFilter. I am not sure if this would > >> cover > >>>>> your use case but I am using this filter and so far no issues. > >>>>> > >>>>> Thnx > >>>>> > >>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman < > >> amanda.shu...@gmail.com > >>>> > >>>>> wrote: > >>>>> > >>>>>> Thanks, Alex - I have seen a few of those links but never considered > >>>>>> transliteration! We use lucene's Smart Chinese analyzer. The issue > is > >>>>>> basically what is laid out in the old blogspot post, namely this > >> point: > >>>>>> > >>>>>> > >>>>>> "Why approach CJK resource discovery differently? > >>>>>> > >>>>>> 2. Search results must be as script agnostic as possible. > >>>>>> > >>>>>> There is more than one way to write each word. "Simplified" > >> characters > >>>>>> were > >>>>>> emphasized for printed materials in mainland China starting in the > >>> 1950s; > >>>>>> "Traditional" characters were used in printed materials prior to the > >>>>>> 1950s, > >>>>>> and are still used in Taiwan, Hong Kong and Macau today. > >>>>>> Since the characters are distinct, it's as if Chinese materials are > >>>>>> written > >>>>>> in two scripts. > >>>>>> Another way to think about it: every written Chinese word has at > >> least > >>>>>> two > >>>>>> completely different spellings. And it can be mix-n-match: a word > >> can > >>>>>> be > >>>>>> written with one traditional and one simplified character. > >>>>>> Example: Given a user query 舊小說 (traditional for old fiction), > the > >>>>>> results should include matches for 舊小說 (traditional) and 旧小说 > >>> (simplified > >>>>>> characters for old fiction)" > >>>>>> > >>>>>> So, using the example provided above, we are dealing with materials > >>>>>> produced in the 1950s-1970s that do even weirder things like: > >>>>>> > >>>>>> A. 舊小說 > >>>>>> > >>>>>> can also be > >>>>>> > >>>>>> B. 旧小说 (all simplified) > >>>>>> or > >>>>>> C. 旧小說 (first character simplified, last character traditional) > >>>>>> or > >>>>>> D. 舊小 说 (first character traditional, last character simplified) > >>>>>> > >>>>>> Thankfully the middle character was never simplified in recent > times. > >>>>>> > >>>>>> From a historical standpoint, the mixed nature of the characters in > >> the > >>>>>> same word/phrase is because not all simplified characters were > >> adopted > >>> at > >>>>>> the same time by everyone uniformly (good times...). > >>>>>> > >>>>>> The problem seems to be that Solr can easily handle A or B above, > but > >>>>>> NOT C > >>>>>> or D using the Smart Chinese analyzer. I'm not really sure how to > >>> change > >>>>>> that at this point... maybe I should figure out how to contact the > >>>>>> creators > >>>>>> of the analyzer and ask them? > >>>>>> > >>>>>> Amanda > >>>>>> > >>>>>> ------ > >>>>>> Dr. Amanda Shuman > >>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy > >> Project > >>>>>> <http://www.maoistlegacy.uni-freiburg.de/> > >>>>>> PhD, University of California, Santa Cruz > >>>>>> http://www.amandashuman.net/ > >>>>>> http://www.prchistoryresources.org/ > >>>>>> Office: +49 (0) 761 203 4925 > >>>>>> > >>>>>> > >>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch < > >>>>>> arafa...@gmail.com> > >>>>>> wrote: > >>>>>> > >>>>>>> This is probably your start, if not read already: > >>>>>>> https://lucene.apache.org/solr/guide/7_4/language-analysis.html > >>>>>>> > >>>>>>> Otherwise, I think your answer would be somewhere around using > >> ICU4J, > >>>>>>> IBM's library for dealing with Unicode: > >> http://site.icu-project.org/ > >>>>>>> (mentioned on the same page above) > >>>>>>> Specifically, transformations: > >>>>>>> http://userguide.icu-project.org/transforms/general > >>>>>>> > >>>>>>> With that, maybe you map both alphabets into latin. I did that once > >>>>>>> for Thai for a demo: > >>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/ > >>>>>>> collection1/conf/schema.xml#L34 > >>>>>>> > >>>>>>> The challenge is to figure out all the magic rules for that. You'd > >>>>>>> have to dig through the ICU documentation and other web pages. I > >>> found > >>>>>>> this one for example: > >>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system- > >>>>>>> transliterators-available-with-icu4j.html;jsessionid= > >>>>>>> BEAB0AF05A588B97B8A2393054D908C0 > >>>>>>> > >>>>>>> There is also 12 part series on Solr and Asian text processing, > >>> though > >>>>>>> it is a bit old now: http://discovery-grindstone.blogspot.com/ > >>>>>>> > >>>>>>> Hope one of these things help. > >>>>>>> > >>>>>>> Regards, > >>>>>>> Alex. > >>>>>>> > >>>>>>> > >>>>>>> On 20 July 2018 at 03:54, Amanda Shuman <amanda.shu...@gmail.com> > >>>>>> wrote: > >>>>>>>> Hi all, > >>>>>>>> > >>>>>>>> We have a problem. Some of our historical documents have mixed > >>>>>> together > >>>>>>>> simplified and Chinese characters. There seems to be no problem > >>> when > >>>>>>>> searching either traditional or simplified separately - that is, > >>> if a > >>>>>>>> particular string/phrase is all in traditional or simplified, it > >>>>>> finds > >>>>>>> it - > >>>>>>>> but it does not find the string/phrase if the two different > >>>>>> characters > >>>>>>> (one > >>>>>>>> traditional, one simplified) are mixed together in the SAME > >>>>>>> string/phrase. > >>>>>>>> > >>>>>>>> Has anyone ever handled this problem before? I know some > >> libraries > >>>>>> seem > >>>>>>> to > >>>>>>>> have implemented something that seems to be able to handle this, > >>> but > >>>>>> I'm > >>>>>>>> not sure how they did so! > >>>>>>>> > >>>>>>>> Amanda > >>>>>>>> ------ > >>>>>>>> Dr. Amanda Shuman > >>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy > >>>>>> Project > >>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/> > >>>>>>>> PhD, University of California, Santa Cruz > >>>>>>>> http://www.amandashuman.net/ > >>>>>>>> http://www.prchistoryresources.org/ > >>>>>>>> Office: +49 (0) 761 203 4925 > >>>>>>> > >>>>>> > >>>>> > >>>>> > >>>> > >>> > >> > > -- Tomoko Uchida