Yes, sorting Kanji is not so easy as Hiragana/Kanji. We simply expect that collators sort strings based on phonetics regardless of how they written in (Hiragana, Katakana, Kanji.) However a Kanji has multiple (usually 2 or 3) readings. We human naturally make judgement which reading is suitable depending on the situation. That makes things difficult. Maybe an ideal collator should behave and judge like human.
Sorry for a long preamble, I have tried ICUCollationKeyAnalyzer for Kanji, found "not so bad". Very good compared to Unicode codepoint based sorting, but far from perfect. I don't fully know the algorithm they use, but the accuracy might be heavily depends on dictionaries/standards they have. (Just an FYI,) Collators can take rules for adjustment. http://userguide.icu-project.org/collation/api Regards, Tomoko 2014-12-18 18:19 GMT+09:00 Nils Knappmeier <n.knappme...@i-views.de>: > > Hi Tomoko, > > does sorting with Locala.JAPANESE also work for Kanji. Since Hiragana and > Katakana are based on the phonetics, I guess it is easier to define a > sorting order. But Kanji is more similar to the Chinese. > > Thanks, > Nils > > > On 17.12.2014 17:01, Tomoko Uchida wrote: > >> Hi, Nils, >> >> I don't know Chinese at all... but collation is very important in Japanese >> too. >> Lucene has org.apache.lucene.collation package that use ICU4J's collators >> (you can find "lucene-analyzers-icu-4.10.2.jar" in analysis/icu >> directory). >> http://lucene.apache.org/core/4_10_2/analyzers-icu/index. >> html?org/apache/lucene/collation/package-summary.html >> >> ICU4J also supports Chinese, of course. >> http://site.icu-project.org/charts/collation-icu4j-sun >> >> I wrote a test program using ICUCollationKeyAnalyzer, it works well in >> Japanese Hiragana/Katakana. >> Here is a code snippet. >> >> Analyzer collationAnalyzer = new >> ICUCollationKeyAnalyzer(Version.LUCENE_4_10_2, >> Collator.getInstance(Locale.JAPANESE)); >> IndexWriter writer = new IndexWriter(dir, new >> IndexWriterConfig(Version.LUCENE_4_10_2, collationAnalyzer)); >> >> I understand collation is a very difficult problem, so I am not sure this >> works for you... >> I would appreciate if you share your trial/research. >> >> Regards, >> Tomoko >> >> 2014-12-17 20:54 GMT+09:00 Nils Knappmeier <n.knappme...@i-views.de>: >> >>> Hi, >>> >>> is there any implementation for a chinese collator in Lucene. I've seen >>> that there is a chinese analyzer which uses Hidden Markov Models. But >>> sorting seems to be an issue on its own and all my googling hasn't led to >>> any results yet. >>> >>> I understand that this is not a trivial issue and I've read that the >>> chinese tend to prefer other ordering than by name, since sorting orders >>> are so complicated that nobody wants to use them. But we will have to >>> sort >>> search results by name, even though the name is chinese (simplified >>> chinese >>> at the moment, but traditional may also appear later) and currenty >>> chinese >>> words seem to be ordered by their unicode-number, which seems not to be >>> the >>> right order. >>> >>> Thanks in advance for any suggestion, >>> Nils >>> >>> >