On Sat, 06 Jun 2009, KU Kam-ming wrote: > Could Invenio index Chinese (multi-byte) characters? It seems that > bibindex will segment a string by space, however, this is not > applicable for CJK.
There is no problem in having multi-byte UTF-8 characters in Invenio; but indeed, as you said, Invenio's default word breaking procedures are not CJK friendly. The phrase search or the regexp search would be the only usable matching options left in such a setup. That said, it is possible to customize the word breaking procedures in Invenio's' workflow. Can you suggest us some nicely working CJK savvy library that would split phrases into words? Preferably in Python or C? Best regards -- Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/>
