Thanks, Alex - I have seen a few of those links but never considered transliteration! We use lucene's Smart Chinese analyzer. The issue is basically what is laid out in the old blogspot post, namely this point:
"Why approach CJK resource discovery differently? 2. Search results must be as script agnostic as possible. There is more than one way to write each word. "Simplified" characters were emphasized for printed materials in mainland China starting in the 1950s; "Traditional" characters were used in printed materials prior to the 1950s, and are still used in Taiwan, Hong Kong and Macau today. Since the characters are distinct, it's as if Chinese materials are written in two scripts. Another way to think about it: every written Chinese word has at least two completely different spellings. And it can be mix-n-match: a word can be written with one traditional and one simplified character. Example: Given a user query 舊小說 (traditional for old fiction), the results should include matches for 舊小說 (traditional) and 旧小说 (simplified characters for old fiction)" So, using the example provided above, we are dealing with materials produced in the 1950s-1970s that do even weirder things like: A. 舊小說 can also be B. 旧小说 (all simplified) or C. 旧小說 (first character simplified, last character traditional) or D. 舊小 说 (first character traditional, last character simplified) Thankfully the middle character was never simplified in recent times. >From a historical standpoint, the mixed nature of the characters in the same word/phrase is because not all simplified characters were adopted at the same time by everyone uniformly (good times...). The problem seems to be that Solr can easily handle A or B above, but NOT C or D using the Smart Chinese analyzer. I'm not really sure how to change that at this point... maybe I should figure out how to contact the creators of the analyzer and ask them? Amanda ------ Dr. Amanda Shuman Post-doc researcher, University of Freiburg, The Maoist Legacy Project <http://www.maoistlegacy.uni-freiburg.de/> PhD, University of California, Santa Cruz http://www.amandashuman.net/ http://www.prchistoryresources.org/ Office: +49 (0) 761 203 4925 On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <arafa...@gmail.com> wrote: > This is probably your start, if not read already: > https://lucene.apache.org/solr/guide/7_4/language-analysis.html > > Otherwise, I think your answer would be somewhere around using ICU4J, > IBM's library for dealing with Unicode: http://site.icu-project.org/ > (mentioned on the same page above) > Specifically, transformations: > http://userguide.icu-project.org/transforms/general > > With that, maybe you map both alphabets into latin. I did that once > for Thai for a demo: > https://github.com/arafalov/solr-thai-test/blob/master/ > collection1/conf/schema.xml#L34 > > The challenge is to figure out all the magic rules for that. You'd > have to dig through the ICU documentation and other web pages. I found > this one for example: > http://avajava.com/tutorials/lessons/what-are-the-system- > transliterators-available-with-icu4j.html;jsessionid= > BEAB0AF05A588B97B8A2393054D908C0 > > There is also 12 part series on Solr and Asian text processing, though > it is a bit old now: http://discovery-grindstone.blogspot.com/ > > Hope one of these things help. > > Regards, > Alex. > > > On 20 July 2018 at 03:54, Amanda Shuman <amanda.shu...@gmail.com> wrote: > > Hi all, > > > > We have a problem. Some of our historical documents have mixed together > > simplified and Chinese characters. There seems to be no problem when > > searching either traditional or simplified separately - that is, if a > > particular string/phrase is all in traditional or simplified, it finds > it - > > but it does not find the string/phrase if the two different characters > (one > > traditional, one simplified) are mixed together in the SAME > string/phrase. > > > > Has anyone ever handled this problem before? I know some libraries seem > to > > have implemented something that seems to be able to handle this, but I'm > > not sure how they did so! > > > > Amanda > > ------ > > Dr. Amanda Shuman > > Post-doc researcher, University of Freiburg, The Maoist Legacy Project > > <http://www.maoistlegacy.uni-freiburg.de/> > > PhD, University of California, Santa Cruz > > http://www.amandashuman.net/ > > http://www.prchistoryresources.org/ > > Office: +49 (0) 761 203 4925 >