Hi, Nils, I don't know Chinese at all... but collation is very important in Japanese too. Lucene has org.apache.lucene.collation package that use ICU4J's collators (you can find "lucene-analyzers-icu-4.10.2.jar" in analysis/icu directory). http://lucene.apache.org/core/4_10_2/analyzers-icu/index.html?org/apache/lucene/collation/package-summary.html
ICU4J also supports Chinese, of course. http://site.icu-project.org/charts/collation-icu4j-sun I wrote a test program using ICUCollationKeyAnalyzer, it works well in Japanese Hiragana/Katakana. Here is a code snippet. Analyzer collationAnalyzer = new ICUCollationKeyAnalyzer(Version.LUCENE_4_10_2, Collator.getInstance(Locale.JAPANESE)); IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_4_10_2, collationAnalyzer)); I understand collation is a very difficult problem, so I am not sure this works for you... I would appreciate if you share your trial/research. Regards, Tomoko 2014-12-17 20:54 GMT+09:00 Nils Knappmeier <n.knappme...@i-views.de>: > > Hi, > > is there any implementation for a chinese collator in Lucene. I've seen > that there is a chinese analyzer which uses Hidden Markov Models. But > sorting seems to be an issue on its own and all my googling hasn't led to > any results yet. > > I understand that this is not a trivial issue and I've read that the > chinese tend to prefer other ordering than by name, since sorting orders > are so complicated that nobody wants to use them. But we will have to sort > search results by name, even though the name is chinese (simplified chinese > at the moment, but traditional may also appear later) and currenty chinese > words seem to be ordered by their unicode-number, which seems not to be the > right order. > > Thanks in advance for any suggestion, > Nils >