Lack of my chinese language knowledge but if you want, I can do quick test for you in Analysis tab if you can give me what to put in index and query window...
On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <susheel2...@gmail.com> wrote: > Have you tried to use CJKFoldingFilter https://github.com/sul-dlss/ > CJKFoldingFilter. I am not sure if this would cover your use case but I > am using this filter and so far no issues. > > Thnx > > On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <amanda.shu...@gmail.com> > wrote: > >> Thanks, Alex - I have seen a few of those links but never considered >> transliteration! We use lucene's Smart Chinese analyzer. The issue is >> basically what is laid out in the old blogspot post, namely this point: >> >> >> "Why approach CJK resource discovery differently? >> >> 2. Search results must be as script agnostic as possible. >> >> There is more than one way to write each word. "Simplified" characters >> were >> emphasized for printed materials in mainland China starting in the 1950s; >> "Traditional" characters were used in printed materials prior to the >> 1950s, >> and are still used in Taiwan, Hong Kong and Macau today. >> Since the characters are distinct, it's as if Chinese materials are >> written >> in two scripts. >> Another way to think about it: every written Chinese word has at least >> two >> completely different spellings. And it can be mix-n-match: a word can be >> written with one traditional and one simplified character. >> Example: Given a user query 舊小說 (traditional for old fiction), the >> results should include matches for 舊小說 (traditional) and 旧小说 (simplified >> characters for old fiction)" >> >> So, using the example provided above, we are dealing with materials >> produced in the 1950s-1970s that do even weirder things like: >> >> A. 舊小說 >> >> can also be >> >> B. 旧小说 (all simplified) >> or >> C. 旧小說 (first character simplified, last character traditional) >> or >> D. 舊小 说 (first character traditional, last character simplified) >> >> Thankfully the middle character was never simplified in recent times. >> >> From a historical standpoint, the mixed nature of the characters in the >> same word/phrase is because not all simplified characters were adopted at >> the same time by everyone uniformly (good times...). >> >> The problem seems to be that Solr can easily handle A or B above, but NOT >> C >> or D using the Smart Chinese analyzer. I'm not really sure how to change >> that at this point... maybe I should figure out how to contact the >> creators >> of the analyzer and ask them? >> >> Amanda >> >> ------ >> Dr. Amanda Shuman >> Post-doc researcher, University of Freiburg, The Maoist Legacy Project >> <http://www.maoistlegacy.uni-freiburg.de/> >> PhD, University of California, Santa Cruz >> http://www.amandashuman.net/ >> http://www.prchistoryresources.org/ >> Office: +49 (0) 761 203 4925 >> >> >> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch < >> arafa...@gmail.com> >> wrote: >> >> > This is probably your start, if not read already: >> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html >> > >> > Otherwise, I think your answer would be somewhere around using ICU4J, >> > IBM's library for dealing with Unicode: http://site.icu-project.org/ >> > (mentioned on the same page above) >> > Specifically, transformations: >> > http://userguide.icu-project.org/transforms/general >> > >> > With that, maybe you map both alphabets into latin. I did that once >> > for Thai for a demo: >> > https://github.com/arafalov/solr-thai-test/blob/master/ >> > collection1/conf/schema.xml#L34 >> > >> > The challenge is to figure out all the magic rules for that. You'd >> > have to dig through the ICU documentation and other web pages. I found >> > this one for example: >> > http://avajava.com/tutorials/lessons/what-are-the-system- >> > transliterators-available-with-icu4j.html;jsessionid= >> > BEAB0AF05A588B97B8A2393054D908C0 >> > >> > There is also 12 part series on Solr and Asian text processing, though >> > it is a bit old now: http://discovery-grindstone.blogspot.com/ >> > >> > Hope one of these things help. >> > >> > Regards, >> > Alex. >> > >> > >> > On 20 July 2018 at 03:54, Amanda Shuman <amanda.shu...@gmail.com> >> wrote: >> > > Hi all, >> > > >> > > We have a problem. Some of our historical documents have mixed >> together >> > > simplified and Chinese characters. There seems to be no problem when >> > > searching either traditional or simplified separately - that is, if a >> > > particular string/phrase is all in traditional or simplified, it finds >> > it - >> > > but it does not find the string/phrase if the two different characters >> > (one >> > > traditional, one simplified) are mixed together in the SAME >> > string/phrase. >> > > >> > > Has anyone ever handled this problem before? I know some libraries >> seem >> > to >> > > have implemented something that seems to be able to handle this, but >> I'm >> > > not sure how they did so! >> > > >> > > Amanda >> > > ------ >> > > Dr. Amanda Shuman >> > > Post-doc researcher, University of Freiburg, The Maoist Legacy Project >> > > <http://www.maoistlegacy.uni-freiburg.de/> >> > > PhD, University of California, Santa Cruz >> > > http://www.amandashuman.net/ >> > > http://www.prchistoryresources.org/ >> > > Office: +49 (0) 761 203 4925 >> > >> > >