Have you tried to use CJKFoldingFilter https://github.com/sul-dlss/CJKFoldingFilter. I am not sure if this would cover your use case but I am using this filter and so far no issues.
Thnx On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <amanda.shu...@gmail.com> wrote: > Thanks, Alex - I have seen a few of those links but never considered > transliteration! We use lucene's Smart Chinese analyzer. The issue is > basically what is laid out in the old blogspot post, namely this point: > > > "Why approach CJK resource discovery differently? > > 2. Search results must be as script agnostic as possible. > > There is more than one way to write each word. "Simplified" characters were > emphasized for printed materials in mainland China starting in the 1950s; > "Traditional" characters were used in printed materials prior to the 1950s, > and are still used in Taiwan, Hong Kong and Macau today. > Since the characters are distinct, it's as if Chinese materials are written > in two scripts. > Another way to think about it: every written Chinese word has at least two > completely different spellings. And it can be mix-n-match: a word can be > written with one traditional and one simplified character. > Example: Given a user query 舊小說 (traditional for old fiction), the > results should include matches for 舊小說 (traditional) and 旧小说 (simplified > characters for old fiction)" > > So, using the example provided above, we are dealing with materials > produced in the 1950s-1970s that do even weirder things like: > > A. 舊小說 > > can also be > > B. 旧小说 (all simplified) > or > C. 旧小說 (first character simplified, last character traditional) > or > D. 舊小 说 (first character traditional, last character simplified) > > Thankfully the middle character was never simplified in recent times. > > From a historical standpoint, the mixed nature of the characters in the > same word/phrase is because not all simplified characters were adopted at > the same time by everyone uniformly (good times...). > > The problem seems to be that Solr can easily handle A or B above, but NOT C > or D using the Smart Chinese analyzer. I'm not really sure how to change > that at this point... maybe I should figure out how to contact the creators > of the analyzer and ask them? > > Amanda > > ------ > Dr. Amanda Shuman > Post-doc researcher, University of Freiburg, The Maoist Legacy Project > <http://www.maoistlegacy.uni-freiburg.de/> > PhD, University of California, Santa Cruz > http://www.amandashuman.net/ > http://www.prchistoryresources.org/ > Office: +49 (0) 761 203 4925 > > > On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <arafa...@gmail.com > > > wrote: > > > This is probably your start, if not read already: > > https://lucene.apache.org/solr/guide/7_4/language-analysis.html > > > > Otherwise, I think your answer would be somewhere around using ICU4J, > > IBM's library for dealing with Unicode: http://site.icu-project.org/ > > (mentioned on the same page above) > > Specifically, transformations: > > http://userguide.icu-project.org/transforms/general > > > > With that, maybe you map both alphabets into latin. I did that once > > for Thai for a demo: > > https://github.com/arafalov/solr-thai-test/blob/master/ > > collection1/conf/schema.xml#L34 > > > > The challenge is to figure out all the magic rules for that. You'd > > have to dig through the ICU documentation and other web pages. I found > > this one for example: > > http://avajava.com/tutorials/lessons/what-are-the-system- > > transliterators-available-with-icu4j.html;jsessionid= > > BEAB0AF05A588B97B8A2393054D908C0 > > > > There is also 12 part series on Solr and Asian text processing, though > > it is a bit old now: http://discovery-grindstone.blogspot.com/ > > > > Hope one of these things help. > > > > Regards, > > Alex. > > > > > > On 20 July 2018 at 03:54, Amanda Shuman <amanda.shu...@gmail.com> wrote: > > > Hi all, > > > > > > We have a problem. Some of our historical documents have mixed together > > > simplified and Chinese characters. There seems to be no problem when > > > searching either traditional or simplified separately - that is, if a > > > particular string/phrase is all in traditional or simplified, it finds > > it - > > > but it does not find the string/phrase if the two different characters > > (one > > > traditional, one simplified) are mixed together in the SAME > > string/phrase. > > > > > > Has anyone ever handled this problem before? I know some libraries seem > > to > > > have implemented something that seems to be able to handle this, but > I'm > > > not sure how they did so! > > > > > > Amanda > > > ------ > > > Dr. Amanda Shuman > > > Post-doc researcher, University of Freiburg, The Maoist Legacy Project > > > <http://www.maoistlegacy.uni-freiburg.de/> > > > PhD, University of California, Santa Cruz > > > http://www.amandashuman.net/ > > > http://www.prchistoryresources.org/ > > > Office: +49 (0) 761 203 4925 > > >