I expect that this is the line that does the transformation: <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
This mapping is a standard feature of ICU. More info on ICU transforms is in this doc, though not much detail on this particular transform. http://userguide.icu-project.org/transforms/general wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jul 20, 2018, at 7:43 AM, Susheel Kumar <susheel2...@gmail.com> wrote: > > I think so. I used the exact as in github > > <fieldType name="text_cjk" class="solr.TextField" > positionIncrementGap="10000" autoGeneratePhraseQueries="false"> > <analyzer> > <tokenizer class="solr.ICUTokenizerFactory" /> > <filter class="solr.CJKWidthFilterFactory"/> > <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/> > <filter class="solr.ICUTransformFilterFactory" > id="Traditional-Simplified"/> > <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/> > <filter class="solr.ICUFoldingFilterFactory"/> > <filter class="solr.CJKBigramFilterFactory" han="true" > hiragana="true" katakana="true" hangul="true" outputUnigrams="true" /> > </analyzer> > </fieldType> > > > > On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <amanda.shu...@gmail.com> > wrote: > >> Thanks! That does indeed look promising... This can be added on top of >> Smart Chinese, right? Or is it an alternative? >> >> >> ------ >> Dr. Amanda Shuman >> Post-doc researcher, University of Freiburg, The Maoist Legacy Project >> <http://www.maoistlegacy.uni-freiburg.de/> >> PhD, University of California, Santa Cruz >> http://www.amandashuman.net/ >> http://www.prchistoryresources.org/ >> Office: +49 (0) 761 203 4925 >> >> >> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <susheel2...@gmail.com> >> wrote: >> >>> I think CJKFoldingFilter will work for you. I put 舊小說 in index and then >>> each of A, B or C or D in query and they seems to be matching and CJKFF >> is >>> transforming the 舊 to 旧 >>> >>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <susheel2...@gmail.com> >>> wrote: >>> >>>> Lack of my chinese language knowledge but if you want, I can do quick >>> test >>>> for you in Analysis tab if you can give me what to put in index and >> query >>>> window... >>>> >>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <susheel2...@gmail.com> >>>> wrote: >>>> >>>>> Have you tried to use CJKFoldingFilter https://g >>>>> ithub.com/sul-dlss/CJKFoldingFilter. I am not sure if this would >> cover >>>>> your use case but I am using this filter and so far no issues. >>>>> >>>>> Thnx >>>>> >>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman < >> amanda.shu...@gmail.com >>>> >>>>> wrote: >>>>> >>>>>> Thanks, Alex - I have seen a few of those links but never considered >>>>>> transliteration! We use lucene's Smart Chinese analyzer. The issue is >>>>>> basically what is laid out in the old blogspot post, namely this >> point: >>>>>> >>>>>> >>>>>> "Why approach CJK resource discovery differently? >>>>>> >>>>>> 2. Search results must be as script agnostic as possible. >>>>>> >>>>>> There is more than one way to write each word. "Simplified" >> characters >>>>>> were >>>>>> emphasized for printed materials in mainland China starting in the >>> 1950s; >>>>>> "Traditional" characters were used in printed materials prior to the >>>>>> 1950s, >>>>>> and are still used in Taiwan, Hong Kong and Macau today. >>>>>> Since the characters are distinct, it's as if Chinese materials are >>>>>> written >>>>>> in two scripts. >>>>>> Another way to think about it: every written Chinese word has at >> least >>>>>> two >>>>>> completely different spellings. And it can be mix-n-match: a word >> can >>>>>> be >>>>>> written with one traditional and one simplified character. >>>>>> Example: Given a user query 舊小說 (traditional for old fiction), the >>>>>> results should include matches for 舊小說 (traditional) and 旧小说 >>> (simplified >>>>>> characters for old fiction)" >>>>>> >>>>>> So, using the example provided above, we are dealing with materials >>>>>> produced in the 1950s-1970s that do even weirder things like: >>>>>> >>>>>> A. 舊小說 >>>>>> >>>>>> can also be >>>>>> >>>>>> B. 旧小说 (all simplified) >>>>>> or >>>>>> C. 旧小說 (first character simplified, last character traditional) >>>>>> or >>>>>> D. 舊小 说 (first character traditional, last character simplified) >>>>>> >>>>>> Thankfully the middle character was never simplified in recent times. >>>>>> >>>>>> From a historical standpoint, the mixed nature of the characters in >> the >>>>>> same word/phrase is because not all simplified characters were >> adopted >>> at >>>>>> the same time by everyone uniformly (good times...). >>>>>> >>>>>> The problem seems to be that Solr can easily handle A or B above, but >>>>>> NOT C >>>>>> or D using the Smart Chinese analyzer. I'm not really sure how to >>> change >>>>>> that at this point... maybe I should figure out how to contact the >>>>>> creators >>>>>> of the analyzer and ask them? >>>>>> >>>>>> Amanda >>>>>> >>>>>> ------ >>>>>> Dr. Amanda Shuman >>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy >> Project >>>>>> <http://www.maoistlegacy.uni-freiburg.de/> >>>>>> PhD, University of California, Santa Cruz >>>>>> http://www.amandashuman.net/ >>>>>> http://www.prchistoryresources.org/ >>>>>> Office: +49 (0) 761 203 4925 >>>>>> >>>>>> >>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch < >>>>>> arafa...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> This is probably your start, if not read already: >>>>>>> https://lucene.apache.org/solr/guide/7_4/language-analysis.html >>>>>>> >>>>>>> Otherwise, I think your answer would be somewhere around using >> ICU4J, >>>>>>> IBM's library for dealing with Unicode: >> http://site.icu-project.org/ >>>>>>> (mentioned on the same page above) >>>>>>> Specifically, transformations: >>>>>>> http://userguide.icu-project.org/transforms/general >>>>>>> >>>>>>> With that, maybe you map both alphabets into latin. I did that once >>>>>>> for Thai for a demo: >>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/ >>>>>>> collection1/conf/schema.xml#L34 >>>>>>> >>>>>>> The challenge is to figure out all the magic rules for that. You'd >>>>>>> have to dig through the ICU documentation and other web pages. I >>> found >>>>>>> this one for example: >>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system- >>>>>>> transliterators-available-with-icu4j.html;jsessionid= >>>>>>> BEAB0AF05A588B97B8A2393054D908C0 >>>>>>> >>>>>>> There is also 12 part series on Solr and Asian text processing, >>> though >>>>>>> it is a bit old now: http://discovery-grindstone.blogspot.com/ >>>>>>> >>>>>>> Hope one of these things help. >>>>>>> >>>>>>> Regards, >>>>>>> Alex. >>>>>>> >>>>>>> >>>>>>> On 20 July 2018 at 03:54, Amanda Shuman <amanda.shu...@gmail.com> >>>>>> wrote: >>>>>>>> Hi all, >>>>>>>> >>>>>>>> We have a problem. Some of our historical documents have mixed >>>>>> together >>>>>>>> simplified and Chinese characters. There seems to be no problem >>> when >>>>>>>> searching either traditional or simplified separately - that is, >>> if a >>>>>>>> particular string/phrase is all in traditional or simplified, it >>>>>> finds >>>>>>> it - >>>>>>>> but it does not find the string/phrase if the two different >>>>>> characters >>>>>>> (one >>>>>>>> traditional, one simplified) are mixed together in the SAME >>>>>>> string/phrase. >>>>>>>> >>>>>>>> Has anyone ever handled this problem before? I know some >> libraries >>>>>> seem >>>>>>> to >>>>>>>> have implemented something that seems to be able to handle this, >>> but >>>>>> I'm >>>>>>>> not sure how they did so! >>>>>>>> >>>>>>>> Amanda >>>>>>>> ------ >>>>>>>> Dr. Amanda Shuman >>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy >>>>>> Project >>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/> >>>>>>>> PhD, University of California, Santa Cruz >>>>>>>> http://www.amandashuman.net/ >>>>>>>> http://www.prchistoryresources.org/ >>>>>>>> Office: +49 (0) 761 203 4925 >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>