In two messages around Jan 30, 2012, Jeff Fearn wrote: > Collating the three kana scripts of Japanese properly is the Mt > Everest of this challenge. > > [...] > > Collating each of them separately is easy, but it's perfectly valid > in Japanese to mix them so you have to be able to collate all of > them together. AFAIK no one has done that in any open source > project.
Apparently, the mapping from a string of Kanji to its pronunciation (ordering) isn't even a deterministic operation, at least for proper names. (The example I came across is that the woman's name 角田 純子 has at least four possible readings of the family name times two possible readings of the given name.) Thus, the solution would have to involve supplying pronunciations somehow for at least some glossary entries. Once pronunciations (in Katakana or Hiragana) are available for all the glossary entries, the Lingua::JA::Sort::JIS perl module can be used to do the JIS X 4061:1996 collation among them. Really, the problem would benefit from Japanese input on how the problem is usually solved. The Japanese translators might be able to help there, at least as to how they supply pronunciations to other computer software that needs to know sorting order. (Btw, if anyone was going to try looking up JIS X 4061:1996, then unfortunately it looks like it's only available for a fee and in Japanese: http://www.webstore.jsa.or.jp/webstore/Com/FlowControl.jsp?lang=en&bunsyoId=JIS+X+4061%3A1996&dantaiCd=JIS&status=1&pageNo=6 However, I'm told that the Japanese wikipedia article http://ja.wikipedia.org/wiki/日本語文字列照合順番 has an overview. The google translation of that page is challenging to read, though: http://translate.google.com/translate?sl=ja&tl=en&u=http%3A%2F%2Fja.wikipedia.org%2Fwiki%2F%E6%97%A5%E6%9C%AC%E8%AA%9E%E6%96%87%E5%AD%97%E5%88%97%E7%85%A7%E5%90%88%E9%A0%86%E7%95%AA .) pjrm. _______________________________________________ publican-list mailing list publican-list@redhat.com https://www.redhat.com/mailman/listinfo/publican-list Wiki: https://fedorahosted.org/publican