On Wed, 10 Jun 2009, KU Kam-ming wrote: > I can say 'YES' at the moment. This is the simplest way to do CJK > segmentation, but it is not lexically correct. Good CJK segmentation > involves AI with language processing (or dictionary based). It takes time > to do so. In the meantime, it would be better to do as what you say for > CJK.
Thanks for the confirmation. This task is now savannized: <https://savannah.cern.ch/task/index.php?10089> Can you please prepare a sample CJK text so that we could test this? CDS Invenio demo site contains a collection called Poetry that is well suited for this: <http://invenio-demo.cern.ch/collection/Poetry> So, if you can send us some old and famous classical Chinese poetry piece in the UTF-8 format, formatted like the Russian poem: <http://invenio-demo.cern.ch/record/75/export/xm> 041__a ... language code 100__a ... author of the poem 245__a ... title of the poem 520__a ... full body or an excerpt from the poem 909C0y ... year of the poem then we'll add it to the demo site and create some test cases for CJK indexing. (Please also add what typical logograms one might input when searching for words in this poem.) Best regards -- Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/>
