Yes, while traditional - simplified transformation would be out of the scope of Unicode normalization, you would like to add ICUNormalizer2CharFilterFactory anyway :)
Let me refine my example settings: <analyzer> <charFilter class="solr.ICUNormalizer2CharFilterFactory"/> <tokenizer class="solr.HMMChineseTokenizerFactory"/> <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/> </analyzer> Regards, Tomoko 2018年7月21日(土) 2:54 Alexandre Rafalovitch <arafa...@gmail.com>: > Would ICUNormalizer2CharFilterFactory do? Or at least serve as a > template of what needs to be done. > > Regards, > Alex. > > On 20 July 2018 at 12:40, Walter Underwood <wun...@wunderwood.org> wrote: > > Looks like we need a charfilter version of the ICU transforms. That > could run before the tokenizer. > > > > I’ve never built a charfilter, but it seems like this would be a good > first project for someone who wants to contribute. > > > > wunder > > Walter Underwood > > wun...@wunderwood.org > > http://observer.wunderwood.org/ (my blog) > > > >> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida < > tomoko.uchida.1...@gmail.com> wrote: > >> > >> Exactly. More concretely, the starting point is: replacing your analyzer > >> > >> <analyzer > class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/> > >> > >> to > >> > >> <analyzer> > >> <tokenizer class="solr.HMMChineseTokenizerFactory"/> > >> <filter class="solr.ICUTransformFilterFactory" > >> id="Traditional-Simplified"/> > >> </analyzer> > >> > >> and see if the results are as expected. Then research another filters if > >> your requirements is not met. > >> > >> Just a reminder: HMMChineseTokenizerFactory do not handle traditional > >> characters as I noted previous in post, so ICUTransformFilterFactory is > an > >> incomplete workaround. > >> > >> 2018年7月21日(土) 0:05 Walter Underwood <wun...@wunderwood.org>: > >> > >>> I expect that this is the line that does the transformation: > >>> > >>> <filter class="solr.ICUTransformFilterFactory" > >>> id="Traditional-Simplified"/> > >>> > >>> This mapping is a standard feature of ICU. More info on ICU transforms > is > >>> in this doc, though not much detail on this particular transform. > >>> > >>> http://userguide.icu-project.org/transforms/general > >>> > >>> wunder > >>> Walter Underwood > >>> wun...@wunderwood.org > >>> http://observer.wunderwood.org/ (my blog) > >>> > >>>> On Jul 20, 2018, at 7:43 AM, Susheel Kumar <susheel2...@gmail.com> > >>> wrote: > >>>> > >>>> I think so. I used the exact as in github > >>>> > >>>> <fieldType name="text_cjk" class="solr.TextField" > >>>> positionIncrementGap="10000" autoGeneratePhraseQueries="false"> > >>>> <analyzer> > >>>> <tokenizer class="solr.ICUTokenizerFactory" /> > >>>> <filter class="solr.CJKWidthFilterFactory"/> > >>>> <filter > class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/> > >>>> <filter class="solr.ICUTransformFilterFactory" > >>> id="Traditional-Simplified"/> > >>>> <filter class="solr.ICUTransformFilterFactory" > >>> id="Katakana-Hiragana"/> > >>>> <filter class="solr.ICUFoldingFilterFactory"/> > >>>> <filter class="solr.CJKBigramFilterFactory" han="true" > >>>> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" /> > >>>> </analyzer> > >>>> </fieldType> > >>>> > >>>> > >>>> > >>>> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman < > amanda.shu...@gmail.com > >>>> > >>>> wrote: > >>>> > >>>>> Thanks! That does indeed look promising... This can be added on top > of > >>>>> Smart Chinese, right? Or is it an alternative? > >>>>> > >>>>> > >>>>> ------ > >>>>> Dr. Amanda Shuman > >>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy > Project > >>>>> <http://www.maoistlegacy.uni-freiburg.de/> > >>>>> PhD, University of California, Santa Cruz > >>>>> http://www.amandashuman.net/ > >>>>> http://www.prchistoryresources.org/ > >>>>> Office: +49 (0) 761 203 4925 > >>>>> > >>>>> > >>>>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar < > susheel2...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>> I think CJKFoldingFilter will work for you. I put 舊小說 in index and > >>> then > >>>>>> each of A, B or C or D in query and they seems to be matching and > CJKFF > >>>>> is > >>>>>> transforming the 舊 to 旧 > >>>>>> > >>>>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar < > susheel2...@gmail.com> > >>>>>> wrote: > >>>>>> > >>>>>>> Lack of my chinese language knowledge but if you want, I can do > quick > >>>>>> test > >>>>>>> for you in Analysis tab if you can give me what to put in index and > >>>>> query > >>>>>>> window... > >>>>>>> > >>>>>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar < > susheel2...@gmail.com > >>>> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Have you tried to use CJKFoldingFilter https://g > >>>>>>>> ithub.com/sul-dlss/CJKFoldingFilter. I am not sure if this would > >>>>> cover > >>>>>>>> your use case but I am using this filter and so far no issues. > >>>>>>>> > >>>>>>>> Thnx > >>>>>>>> > >>>>>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman < > >>>>> amanda.shu...@gmail.com > >>>>>>> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Thanks, Alex - I have seen a few of those links but never > considered > >>>>>>>>> transliteration! We use lucene's Smart Chinese analyzer. The > issue > >>> is > >>>>>>>>> basically what is laid out in the old blogspot post, namely this > >>>>> point: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> "Why approach CJK resource discovery differently? > >>>>>>>>> > >>>>>>>>> 2. Search results must be as script agnostic as possible. > >>>>>>>>> > >>>>>>>>> There is more than one way to write each word. "Simplified" > >>>>> characters > >>>>>>>>> were > >>>>>>>>> emphasized for printed materials in mainland China starting in > the > >>>>>> 1950s; > >>>>>>>>> "Traditional" characters were used in printed materials prior to > the > >>>>>>>>> 1950s, > >>>>>>>>> and are still used in Taiwan, Hong Kong and Macau today. > >>>>>>>>> Since the characters are distinct, it's as if Chinese materials > are > >>>>>>>>> written > >>>>>>>>> in two scripts. > >>>>>>>>> Another way to think about it: every written Chinese word has at > >>>>> least > >>>>>>>>> two > >>>>>>>>> completely different spellings. And it can be mix-n-match: a > word > >>>>> can > >>>>>>>>> be > >>>>>>>>> written with one traditional and one simplified character. > >>>>>>>>> Example: Given a user query 舊小說 (traditional for old fiction), > >>> the > >>>>>>>>> results should include matches for 舊小說 (traditional) and 旧小说 > >>>>>> (simplified > >>>>>>>>> characters for old fiction)" > >>>>>>>>> > >>>>>>>>> So, using the example provided above, we are dealing with > materials > >>>>>>>>> produced in the 1950s-1970s that do even weirder things like: > >>>>>>>>> > >>>>>>>>> A. 舊小說 > >>>>>>>>> > >>>>>>>>> can also be > >>>>>>>>> > >>>>>>>>> B. 旧小说 (all simplified) > >>>>>>>>> or > >>>>>>>>> C. 旧小說 (first character simplified, last character traditional) > >>>>>>>>> or > >>>>>>>>> D. 舊小 说 (first character traditional, last character simplified) > >>>>>>>>> > >>>>>>>>> Thankfully the middle character was never simplified in recent > >>> times. > >>>>>>>>> > >>>>>>>>> From a historical standpoint, the mixed nature of the characters > in > >>>>> the > >>>>>>>>> same word/phrase is because not all simplified characters were > >>>>> adopted > >>>>>> at > >>>>>>>>> the same time by everyone uniformly (good times...). > >>>>>>>>> > >>>>>>>>> The problem seems to be that Solr can easily handle A or B above, > >>> but > >>>>>>>>> NOT C > >>>>>>>>> or D using the Smart Chinese analyzer. I'm not really sure how to > >>>>>> change > >>>>>>>>> that at this point... maybe I should figure out how to contact > the > >>>>>>>>> creators > >>>>>>>>> of the analyzer and ask them? > >>>>>>>>> > >>>>>>>>> Amanda > >>>>>>>>> > >>>>>>>>> ------ > >>>>>>>>> Dr. Amanda Shuman > >>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy > >>>>> Project > >>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/> > >>>>>>>>> PhD, University of California, Santa Cruz > >>>>>>>>> http://www.amandashuman.net/ > >>>>>>>>> http://www.prchistoryresources.org/ > >>>>>>>>> Office: +49 (0) 761 203 4925 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch < > >>>>>>>>> arafa...@gmail.com> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> This is probably your start, if not read already: > >>>>>>>>>> https://lucene.apache.org/solr/guide/7_4/language-analysis.html > >>>>>>>>>> > >>>>>>>>>> Otherwise, I think your answer would be somewhere around using > >>>>> ICU4J, > >>>>>>>>>> IBM's library for dealing with Unicode: > >>>>> http://site.icu-project.org/ > >>>>>>>>>> (mentioned on the same page above) > >>>>>>>>>> Specifically, transformations: > >>>>>>>>>> http://userguide.icu-project.org/transforms/general > >>>>>>>>>> > >>>>>>>>>> With that, maybe you map both alphabets into latin. I did that > once > >>>>>>>>>> for Thai for a demo: > >>>>>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/ > >>>>>>>>>> collection1/conf/schema.xml#L34 > >>>>>>>>>> > >>>>>>>>>> The challenge is to figure out all the magic rules for that. > You'd > >>>>>>>>>> have to dig through the ICU documentation and other web pages. I > >>>>>> found > >>>>>>>>>> this one for example: > >>>>>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system- > >>>>>>>>>> transliterators-available-with-icu4j.html;jsessionid= > >>>>>>>>>> BEAB0AF05A588B97B8A2393054D908C0 > >>>>>>>>>> > >>>>>>>>>> There is also 12 part series on Solr and Asian text processing, > >>>>>> though > >>>>>>>>>> it is a bit old now: http://discovery-grindstone.blogspot.com/ > >>>>>>>>>> > >>>>>>>>>> Hope one of these things help. > >>>>>>>>>> > >>>>>>>>>> Regards, > >>>>>>>>>> Alex. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On 20 July 2018 at 03:54, Amanda Shuman < > amanda.shu...@gmail.com> > >>>>>>>>> wrote: > >>>>>>>>>>> Hi all, > >>>>>>>>>>> > >>>>>>>>>>> We have a problem. Some of our historical documents have mixed > >>>>>>>>> together > >>>>>>>>>>> simplified and Chinese characters. There seems to be no problem > >>>>>> when > >>>>>>>>>>> searching either traditional or simplified separately - that > is, > >>>>>> if a > >>>>>>>>>>> particular string/phrase is all in traditional or simplified, > it > >>>>>>>>> finds > >>>>>>>>>> it - > >>>>>>>>>>> but it does not find the string/phrase if the two different > >>>>>>>>> characters > >>>>>>>>>> (one > >>>>>>>>>>> traditional, one simplified) are mixed together in the SAME > >>>>>>>>>> string/phrase. > >>>>>>>>>>> > >>>>>>>>>>> Has anyone ever handled this problem before? I know some > >>>>> libraries > >>>>>>>>> seem > >>>>>>>>>> to > >>>>>>>>>>> have implemented something that seems to be able to handle > this, > >>>>>> but > >>>>>>>>> I'm > >>>>>>>>>>> not sure how they did so! > >>>>>>>>>>> > >>>>>>>>>>> Amanda > >>>>>>>>>>> ------ > >>>>>>>>>>> Dr. Amanda Shuman > >>>>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy > >>>>>>>>> Project > >>>>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/> > >>>>>>>>>>> PhD, University of California, Santa Cruz > >>>>>>>>>>> http://www.amandashuman.net/ > >>>>>>>>>>> http://www.prchistoryresources.org/ > >>>>>>>>>>> Office: +49 (0) 761 203 4925 > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>> > >>> > >> > >> -- > >> Tomoko Uchida > > > -- Tomoko Uchida