Re: invenio indexes CJK ?i

Tibor Simko Fri, 12 Jun 2009 16:33:09 +0200

On Wed, 10 Jun 2009, KU Kam-ming wrote:
> I can say 'YES' at the moment.  This is the simplest way to do CJK
> segmentation, but it is not lexically correct.  Good CJK segmentation
> involves AI with language processing (or dictionary based).  It takes time
> to do so.  In the meantime, it would be better to do as what you say for
> CJK.


Thanks for the confirmation.  This task is now savannized:

   <https://savannah.cern.ch/task/index.php?10089>

Can you please prepare a sample CJK text so that we could test this?
CDS Invenio demo site contains a collection called Poetry that is well
suited for this:

   <http://invenio-demo.cern.ch/collection/Poetry>

So, if you can send us some old and famous classical Chinese poetry
piece in the UTF-8 format, formatted like the Russian poem:

   <http://invenio-demo.cern.ch/record/75/export/xm>

     041__a ... language code
     100__a ... author of the poem
     245__a ... title of the poem
     520__a ... full body or an excerpt from the poem
     909C0y ... year of the poem

then we'll add it to the demo site and create some test cases for CJK
indexing.  (Please also add what typical logograms one might input when
searching for words in this poem.)

Best regards
-- 
Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/>

Re: invenio indexes CJK ?i

Reply via email to