Hi Tomoko, Thanks so much for this explanation - I did not even know this was possible! I will try it out but I have one question: do all I need to do is modify the settings from smartChinese to the ones you posted here:
<analyzer> <charFilter class="solr.ICUNormalizer2CharFilterFactory"/> <tokenizer class="solr.HMMChineseTokenizerFactory"/> <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/> </analyzer> Or do I need to still do something with the SmartChineseAnalyzer? I did not quite understand this in your first message: " I think you need two steps if you want to use HMMChineseTokenizer correctly. 1. transform all traditional characters to simplified ones and save to temporary files. I do not have clear idea for doing this, but you can create a Java program that calls Lucene's ICUTransformFilter 2. then, index to Solr using SmartChineseAnalyzer." My understanding is that with the new settings you posted, I don't need to do these steps. Is that correct? Otherwise, I don't really know how to do step 1 with the java program.... Thanks! Amanda ------ Dr. Amanda Shuman Post-doc researcher, University of Freiburg, The Maoist Legacy Project <http://www.maoistlegacy.uni-freiburg.de/> PhD, University of California, Santa Cruz http://www.amandashuman.net/ http://www.prchistoryresources.org/ Office: +49 (0) 761 203 4925 On Fri, Jul 20, 2018 at 8:03 PM, Tomoko Uchida <tomoko.uchida.1...@gmail.com > wrote: > Yes, while traditional - simplified transformation would be out of the > scope of Unicode normalization, > you would like to add ICUNormalizer2CharFilterFactory anyway :) > > Let me refine my example settings: > > <analyzer> > <charFilter class="solr.ICUNormalizer2CharFilterFactory"/> > <tokenizer class="solr.HMMChineseTokenizerFactory"/> > <filter class="solr.ICUTransformFilterFactory" > id="Traditional-Simplified"/> > </analyzer> > > Regards, > Tomoko > > > 2018年7月21日(土) 2:54 Alexandre Rafalovitch <arafa...@gmail.com>: > > > Would ICUNormalizer2CharFilterFactory do? Or at least serve as a > > template of what needs to be done. > > > > Regards, > > Alex. > > > > On 20 July 2018 at 12:40, Walter Underwood <wun...@wunderwood.org> > wrote: > > > Looks like we need a charfilter version of the ICU transforms. That > > could run before the tokenizer. > > > > > > I’ve never built a charfilter, but it seems like this would be a good > > first project for someone who wants to contribute. > > > > > > wunder > > > Walter Underwood > > > wun...@wunderwood.org > > > http://observer.wunderwood.org/ (my blog) > > > > > >> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida < > > tomoko.uchida.1...@gmail.com> wrote: > > >> > > >> Exactly. More concretely, the starting point is: replacing your > analyzer > > >> > > >> <analyzer > > class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/> > > >> > > >> to > > >> > > >> <analyzer> > > >> <tokenizer class="solr.HMMChineseTokenizerFactory"/> > > >> <filter class="solr.ICUTransformFilterFactory" > > >> id="Traditional-Simplified"/> > > >> </analyzer> > > >> > > >> and see if the results are as expected. Then research another filters > if > > >> your requirements is not met. > > >> > > >> Just a reminder: HMMChineseTokenizerFactory do not handle traditional > > >> characters as I noted previous in post, so ICUTransformFilterFactory > is > > an > > >> incomplete workaround. > > >> > > >> 2018年7月21日(土) 0:05 Walter Underwood <wun...@wunderwood.org>: > > >> > > >>> I expect that this is the line that does the transformation: > > >>> > > >>> <filter class="solr.ICUTransformFilterFactory" > > >>> id="Traditional-Simplified"/> > > >>> > > >>> This mapping is a standard feature of ICU. More info on ICU > transforms > > is > > >>> in this doc, though not much detail on this particular transform. > > >>> > > >>> http://userguide.icu-project.org/transforms/general > > >>> > > >>> wunder > > >>> Walter Underwood > > >>> wun...@wunderwood.org > > >>> http://observer.wunderwood.org/ (my blog) > > >>> > > >>>> On Jul 20, 2018, at 7:43 AM, Susheel Kumar <susheel2...@gmail.com> > > >>> wrote: > > >>>> > > >>>> I think so. I used the exact as in github > > >>>> > > >>>> <fieldType name="text_cjk" class="solr.TextField" > > >>>> positionIncrementGap="10000" autoGeneratePhraseQueries="false"> > > >>>> <analyzer> > > >>>> <tokenizer class="solr.ICUTokenizerFactory" /> > > >>>> <filter class="solr.CJKWidthFilterFactory"/> > > >>>> <filter > > class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/> > > >>>> <filter class="solr.ICUTransformFilterFactory" > > >>> id="Traditional-Simplified"/> > > >>>> <filter class="solr.ICUTransformFilterFactory" > > >>> id="Katakana-Hiragana"/> > > >>>> <filter class="solr.ICUFoldingFilterFactory"/> > > >>>> <filter class="solr.CJKBigramFilterFactory" han="true" > > >>>> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" > /> > > >>>> </analyzer> > > >>>> </fieldType> > > >>>> > > >>>> > > >>>> > > >>>> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman < > > amanda.shu...@gmail.com > > >>>> > > >>>> wrote: > > >>>> > > >>>>> Thanks! That does indeed look promising... This can be added on top > > of > > >>>>> Smart Chinese, right? Or is it an alternative? > > >>>>> > > >>>>> > > >>>>> ------ > > >>>>> Dr. Amanda Shuman > > >>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy > > Project > > >>>>> <http://www.maoistlegacy.uni-freiburg.de/> > > >>>>> PhD, University of California, Santa Cruz > > >>>>> http://www.amandashuman.net/ > > >>>>> http://www.prchistoryresources.org/ > > >>>>> Office: +49 (0) 761 203 4925 > > >>>>> > > >>>>> > > >>>>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar < > > susheel2...@gmail.com> > > >>>>> wrote: > > >>>>> > > >>>>>> I think CJKFoldingFilter will work for you. I put 舊小說 in index > and > > >>> then > > >>>>>> each of A, B or C or D in query and they seems to be matching and > > CJKFF > > >>>>> is > > >>>>>> transforming the 舊 to 旧 > > >>>>>> > > >>>>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar < > > susheel2...@gmail.com> > > >>>>>> wrote: > > >>>>>> > > >>>>>>> Lack of my chinese language knowledge but if you want, I can do > > quick > > >>>>>> test > > >>>>>>> for you in Analysis tab if you can give me what to put in index > and > > >>>>> query > > >>>>>>> window... > > >>>>>>> > > >>>>>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar < > > susheel2...@gmail.com > > >>>> > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>>> Have you tried to use CJKFoldingFilter https://g > > >>>>>>>> ithub.com/sul-dlss/CJKFoldingFilter. I am not sure if this > would > > >>>>> cover > > >>>>>>>> your use case but I am using this filter and so far no issues. > > >>>>>>>> > > >>>>>>>> Thnx > > >>>>>>>> > > >>>>>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman < > > >>>>> amanda.shu...@gmail.com > > >>>>>>> > > >>>>>>>> wrote: > > >>>>>>>> > > >>>>>>>>> Thanks, Alex - I have seen a few of those links but never > > considered > > >>>>>>>>> transliteration! We use lucene's Smart Chinese analyzer. The > > issue > > >>> is > > >>>>>>>>> basically what is laid out in the old blogspot post, namely > this > > >>>>> point: > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> "Why approach CJK resource discovery differently? > > >>>>>>>>> > > >>>>>>>>> 2. Search results must be as script agnostic as possible. > > >>>>>>>>> > > >>>>>>>>> There is more than one way to write each word. "Simplified" > > >>>>> characters > > >>>>>>>>> were > > >>>>>>>>> emphasized for printed materials in mainland China starting in > > the > > >>>>>> 1950s; > > >>>>>>>>> "Traditional" characters were used in printed materials prior > to > > the > > >>>>>>>>> 1950s, > > >>>>>>>>> and are still used in Taiwan, Hong Kong and Macau today. > > >>>>>>>>> Since the characters are distinct, it's as if Chinese materials > > are > > >>>>>>>>> written > > >>>>>>>>> in two scripts. > > >>>>>>>>> Another way to think about it: every written Chinese word has > at > > >>>>> least > > >>>>>>>>> two > > >>>>>>>>> completely different spellings. And it can be mix-n-match: a > > word > > >>>>> can > > >>>>>>>>> be > > >>>>>>>>> written with one traditional and one simplified character. > > >>>>>>>>> Example: Given a user query 舊小說 (traditional for old > fiction), > > >>> the > > >>>>>>>>> results should include matches for 舊小說 (traditional) and 旧小说 > > >>>>>> (simplified > > >>>>>>>>> characters for old fiction)" > > >>>>>>>>> > > >>>>>>>>> So, using the example provided above, we are dealing with > > materials > > >>>>>>>>> produced in the 1950s-1970s that do even weirder things like: > > >>>>>>>>> > > >>>>>>>>> A. 舊小說 > > >>>>>>>>> > > >>>>>>>>> can also be > > >>>>>>>>> > > >>>>>>>>> B. 旧小说 (all simplified) > > >>>>>>>>> or > > >>>>>>>>> C. 旧小說 (first character simplified, last character traditional) > > >>>>>>>>> or > > >>>>>>>>> D. 舊小 说 (first character traditional, last character > simplified) > > >>>>>>>>> > > >>>>>>>>> Thankfully the middle character was never simplified in recent > > >>> times. > > >>>>>>>>> > > >>>>>>>>> From a historical standpoint, the mixed nature of the > characters > > in > > >>>>> the > > >>>>>>>>> same word/phrase is because not all simplified characters were > > >>>>> adopted > > >>>>>> at > > >>>>>>>>> the same time by everyone uniformly (good times...). > > >>>>>>>>> > > >>>>>>>>> The problem seems to be that Solr can easily handle A or B > above, > > >>> but > > >>>>>>>>> NOT C > > >>>>>>>>> or D using the Smart Chinese analyzer. I'm not really sure how > to > > >>>>>> change > > >>>>>>>>> that at this point... maybe I should figure out how to contact > > the > > >>>>>>>>> creators > > >>>>>>>>> of the analyzer and ask them? > > >>>>>>>>> > > >>>>>>>>> Amanda > > >>>>>>>>> > > >>>>>>>>> ------ > > >>>>>>>>> Dr. Amanda Shuman > > >>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy > > >>>>> Project > > >>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/> > > >>>>>>>>> PhD, University of California, Santa Cruz > > >>>>>>>>> http://www.amandashuman.net/ > > >>>>>>>>> http://www.prchistoryresources.org/ > > >>>>>>>>> Office: +49 (0) 761 203 4925 > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch < > > >>>>>>>>> arafa...@gmail.com> > > >>>>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>>> This is probably your start, if not read already: > > >>>>>>>>>> https://lucene.apache.org/solr/guide/7_4/language- > analysis.html > > >>>>>>>>>> > > >>>>>>>>>> Otherwise, I think your answer would be somewhere around using > > >>>>> ICU4J, > > >>>>>>>>>> IBM's library for dealing with Unicode: > > >>>>> http://site.icu-project.org/ > > >>>>>>>>>> (mentioned on the same page above) > > >>>>>>>>>> Specifically, transformations: > > >>>>>>>>>> http://userguide.icu-project.org/transforms/general > > >>>>>>>>>> > > >>>>>>>>>> With that, maybe you map both alphabets into latin. I did that > > once > > >>>>>>>>>> for Thai for a demo: > > >>>>>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/ > > >>>>>>>>>> collection1/conf/schema.xml#L34 > > >>>>>>>>>> > > >>>>>>>>>> The challenge is to figure out all the magic rules for that. > > You'd > > >>>>>>>>>> have to dig through the ICU documentation and other web > pages. I > > >>>>>> found > > >>>>>>>>>> this one for example: > > >>>>>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system- > > >>>>>>>>>> transliterators-available-with-icu4j.html;jsessionid= > > >>>>>>>>>> BEAB0AF05A588B97B8A2393054D908C0 > > >>>>>>>>>> > > >>>>>>>>>> There is also 12 part series on Solr and Asian text > processing, > > >>>>>> though > > >>>>>>>>>> it is a bit old now: http://discovery-grindstone. > blogspot.com/ > > >>>>>>>>>> > > >>>>>>>>>> Hope one of these things help. > > >>>>>>>>>> > > >>>>>>>>>> Regards, > > >>>>>>>>>> Alex. > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> On 20 July 2018 at 03:54, Amanda Shuman < > > amanda.shu...@gmail.com> > > >>>>>>>>> wrote: > > >>>>>>>>>>> Hi all, > > >>>>>>>>>>> > > >>>>>>>>>>> We have a problem. Some of our historical documents have > mixed > > >>>>>>>>> together > > >>>>>>>>>>> simplified and Chinese characters. There seems to be no > problem > > >>>>>> when > > >>>>>>>>>>> searching either traditional or simplified separately - that > > is, > > >>>>>> if a > > >>>>>>>>>>> particular string/phrase is all in traditional or simplified, > > it > > >>>>>>>>> finds > > >>>>>>>>>> it - > > >>>>>>>>>>> but it does not find the string/phrase if the two different > > >>>>>>>>> characters > > >>>>>>>>>> (one > > >>>>>>>>>>> traditional, one simplified) are mixed together in the SAME > > >>>>>>>>>> string/phrase. > > >>>>>>>>>>> > > >>>>>>>>>>> Has anyone ever handled this problem before? I know some > > >>>>> libraries > > >>>>>>>>> seem > > >>>>>>>>>> to > > >>>>>>>>>>> have implemented something that seems to be able to handle > > this, > > >>>>>> but > > >>>>>>>>> I'm > > >>>>>>>>>>> not sure how they did so! > > >>>>>>>>>>> > > >>>>>>>>>>> Amanda > > >>>>>>>>>>> ------ > > >>>>>>>>>>> Dr. Amanda Shuman > > >>>>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist > Legacy > > >>>>>>>>> Project > > >>>>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/> > > >>>>>>>>>>> PhD, University of California, Santa Cruz > > >>>>>>>>>>> http://www.amandashuman.net/ > > >>>>>>>>>>> http://www.prchistoryresources.org/ > > >>>>>>>>>>> Office: +49 (0) 761 203 4925 > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>> > > >>> > > >> > > >> -- > > >> Tomoko Uchida > > > > > > > > -- > Tomoko Uchida >