Re: Chinese sorting

Nils Knappmeier Fri, 19 Dec 2014 00:18:07 -0800

Hi Tomoko,

thank you for the detailed explanation and many thanks for trying outthe analyzer for me.I think "Very good compared to Unicode codepoint based sorting" is goodenough for me.


I will just try and use that Analyzer and see how it satisfies our customer.

Regards,
  Nils




On 18.12.2014 19:16, Tomoko Uchida wrote:

Yes, sorting Kanji is not so easy as Hiragana/Kanji.

We simply expect that collators sort strings based on phonetics regardless
of how they written in (Hiragana, Katakana, Kanji.)
However a Kanji has multiple (usually 2 or 3) readings. We human naturally
make judgement which reading is suitable depending on the situation.
That makes things difficult. Maybe an ideal collator should behave and
judge like human.

Sorry for a long preamble,
I have tried ICUCollationKeyAnalyzer for Kanji, found "not so bad". Very
good compared to Unicode codepoint based sorting, but far from perfect.
I don't fully know the algorithm they use, but the accuracy might be
heavily depends on dictionaries/standards they have.

(Just an FYI,) Collators can take rules for adjustment.
http://userguide.icu-project.org/collation/api

Regards,
Tomoko




2014-12-18 18:19 GMT+09:00 Nils Knappmeier <n.knappme...@i-views.de>:

Hi Tomoko,

does sorting with Locala.JAPANESE also work for Kanji. Since Hiragana and
Katakana are based on the phonetics, I guess it is easier to define a
sorting order. But Kanji is more similar to the Chinese.

Thanks,
   Nils


On 17.12.2014 17:01, Tomoko Uchida wrote:

Hi, Nils,

I don't know Chinese at all... but collation is very important in Japanese
too.
Lucene has org.apache.lucene.collation package that use ICU4J's collators
(you can find "lucene-analyzers-icu-4.10.2.jar" in analysis/icu
directory).
http://lucene.apache.org/core/4_10_2/analyzers-icu/index.
html?org/apache/lucene/collation/package-summary.html

ICU4J also supports Chinese, of course.
http://site.icu-project.org/charts/collation-icu4j-sun

I wrote a test program using ICUCollationKeyAnalyzer, it works well in
Japanese Hiragana/Katakana.
Here is a code snippet.

Analyzer collationAnalyzer = new
ICUCollationKeyAnalyzer(Version.LUCENE_4_10_2,
Collator.getInstance(Locale.JAPANESE));
IndexWriter writer = new IndexWriter(dir, new
IndexWriterConfig(Version.LUCENE_4_10_2, collationAnalyzer));

I understand collation is a very difficult problem, so I am not sure this
works for you...
I would appreciate if you share your trial/research.

Regards,
Tomoko

2014-12-17 20:54 GMT+09:00 Nils Knappmeier <n.knappme...@i-views.de>:

Hi,

is there any implementation for a chinese collator in Lucene. I've seen
that there is a chinese analyzer which uses Hidden Markov Models. But
sorting seems to be an issue on its own and all my googling hasn't led to
any results yet.

I understand that this is not a trivial issue and I've read that the
chinese tend to prefer other ordering than by name, since sorting orders
are so complicated that nobody wants to use them. But we will have to
sort
search results by name, even though the name is chinese (simplified
chinese
at the moment, but traditional may also appear later) and currenty
chinese
words seem to be ordered by their unicode-number, which seems not to be
the
right order.

Thanks in advance for any suggestion,
   Nils

Re: Chinese sorting

Reply via email to