I think CJKFoldingFilter will work for you. I put 舊小說 in index and then each of A, B or C or D in query and they seems to be matching and CJKFF is transforming the 舊 to 旧
On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <susheel2...@gmail.com> wrote: > Lack of my chinese language knowledge but if you want, I can do quick test > for you in Analysis tab if you can give me what to put in index and query > window... > > On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <susheel2...@gmail.com> > wrote: > >> Have you tried to use CJKFoldingFilter https://g >> ithub.com/sul-dlss/CJKFoldingFilter. I am not sure if this would cover >> your use case but I am using this filter and so far no issues. >> >> Thnx >> >> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <amanda.shu...@gmail.com> >> wrote: >> >>> Thanks, Alex - I have seen a few of those links but never considered >>> transliteration! We use lucene's Smart Chinese analyzer. The issue is >>> basically what is laid out in the old blogspot post, namely this point: >>> >>> >>> "Why approach CJK resource discovery differently? >>> >>> 2. Search results must be as script agnostic as possible. >>> >>> There is more than one way to write each word. "Simplified" characters >>> were >>> emphasized for printed materials in mainland China starting in the 1950s; >>> "Traditional" characters were used in printed materials prior to the >>> 1950s, >>> and are still used in Taiwan, Hong Kong and Macau today. >>> Since the characters are distinct, it's as if Chinese materials are >>> written >>> in two scripts. >>> Another way to think about it: every written Chinese word has at least >>> two >>> completely different spellings. And it can be mix-n-match: a word can >>> be >>> written with one traditional and one simplified character. >>> Example: Given a user query 舊小說 (traditional for old fiction), the >>> results should include matches for 舊小說 (traditional) and 旧小说 (simplified >>> characters for old fiction)" >>> >>> So, using the example provided above, we are dealing with materials >>> produced in the 1950s-1970s that do even weirder things like: >>> >>> A. 舊小說 >>> >>> can also be >>> >>> B. 旧小说 (all simplified) >>> or >>> C. 旧小說 (first character simplified, last character traditional) >>> or >>> D. 舊小 说 (first character traditional, last character simplified) >>> >>> Thankfully the middle character was never simplified in recent times. >>> >>> From a historical standpoint, the mixed nature of the characters in the >>> same word/phrase is because not all simplified characters were adopted at >>> the same time by everyone uniformly (good times...). >>> >>> The problem seems to be that Solr can easily handle A or B above, but >>> NOT C >>> or D using the Smart Chinese analyzer. I'm not really sure how to change >>> that at this point... maybe I should figure out how to contact the >>> creators >>> of the analyzer and ask them? >>> >>> Amanda >>> >>> ------ >>> Dr. Amanda Shuman >>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project >>> <http://www.maoistlegacy.uni-freiburg.de/> >>> PhD, University of California, Santa Cruz >>> http://www.amandashuman.net/ >>> http://www.prchistoryresources.org/ >>> Office: +49 (0) 761 203 4925 >>> >>> >>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch < >>> arafa...@gmail.com> >>> wrote: >>> >>> > This is probably your start, if not read already: >>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html >>> > >>> > Otherwise, I think your answer would be somewhere around using ICU4J, >>> > IBM's library for dealing with Unicode: http://site.icu-project.org/ >>> > (mentioned on the same page above) >>> > Specifically, transformations: >>> > http://userguide.icu-project.org/transforms/general >>> > >>> > With that, maybe you map both alphabets into latin. I did that once >>> > for Thai for a demo: >>> > https://github.com/arafalov/solr-thai-test/blob/master/ >>> > collection1/conf/schema.xml#L34 >>> > >>> > The challenge is to figure out all the magic rules for that. You'd >>> > have to dig through the ICU documentation and other web pages. I found >>> > this one for example: >>> > http://avajava.com/tutorials/lessons/what-are-the-system- >>> > transliterators-available-with-icu4j.html;jsessionid= >>> > BEAB0AF05A588B97B8A2393054D908C0 >>> > >>> > There is also 12 part series on Solr and Asian text processing, though >>> > it is a bit old now: http://discovery-grindstone.blogspot.com/ >>> > >>> > Hope one of these things help. >>> > >>> > Regards, >>> > Alex. >>> > >>> > >>> > On 20 July 2018 at 03:54, Amanda Shuman <amanda.shu...@gmail.com> >>> wrote: >>> > > Hi all, >>> > > >>> > > We have a problem. Some of our historical documents have mixed >>> together >>> > > simplified and Chinese characters. There seems to be no problem when >>> > > searching either traditional or simplified separately - that is, if a >>> > > particular string/phrase is all in traditional or simplified, it >>> finds >>> > it - >>> > > but it does not find the string/phrase if the two different >>> characters >>> > (one >>> > > traditional, one simplified) are mixed together in the SAME >>> > string/phrase. >>> > > >>> > > Has anyone ever handled this problem before? I know some libraries >>> seem >>> > to >>> > > have implemented something that seems to be able to handle this, but >>> I'm >>> > > not sure how they did so! >>> > > >>> > > Amanda >>> > > ------ >>> > > Dr. Amanda Shuman >>> > > Post-doc researcher, University of Freiburg, The Maoist Legacy >>> Project >>> > > <http://www.maoistlegacy.uni-freiburg.de/> >>> > > PhD, University of California, Santa Cruz >>> > > http://www.amandashuman.net/ >>> > > http://www.prchistoryresources.org/ >>> > > Office: +49 (0) 761 203 4925 >>> > >>> >> >> >