Re: invenio indexes CJK ?

Tibor Simko Tue, 9 Jun 2009 14:27:15 +0200

On Sat, 06 Jun 2009, KU Kam-ming wrote:
> Could Invenio index Chinese (multi-byte) characters?  It seems that
> bibindex will segment a string by space, however, this is not
> applicable for CJK.


There is no problem in having multi-byte UTF-8 characters in Invenio;
but indeed, as you said, Invenio's default word breaking procedures are
not CJK friendly.  The phrase search or the regexp search would be the
only usable matching options left in such a setup.

That said, it is possible to customize the word breaking procedures in
Invenio's' workflow.  Can you suggest us some nicely working CJK savvy
library that would split phrases into words?  Preferably in Python or C?

Best regards
-- 
Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/>

Re: invenio indexes CJK ?

Reply via email to