[basex-talk] stemming chinese texts

Duncan Paterson Thu, 15 Oct 2020 04:01:33 -0700

Dear Christian,


I’d be happy to chime in on the quality of basexs Chinese language full-text 
capabilities. Chinese sources are my primary research area. What exactly do you 
have in mind?

Greetings
Duncan

Ceterum censeo exist-db.org esse conriganda



> 
> 
> Today's Topics:
> 
>   1. Re: stemming chinese texts (Philippe Pons)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Wed, 14 Oct 2020 12:30:59 +0200
> From: Philippe Pons <philippe.p...@college-de-france.fr>
> To: basex-talk@mailman.uni-konstanz.de
> Subject: Re: [basex-talk] stemming chinese texts
> Message-ID:
>       <d40e4b6e-29ab-f62f-1617-505db18e9...@college-de-france.fr>
> Content-Type: text/plain; charset="windows-1252"; Format="flowed"
> 
> Hi Christian,
> 
> I suppose some of my colleagues would be able to judge the quality of 
> your full-text search results.
> 
> On the other hand, on code level, I'm not sure I know how to implement 
> an additionnal class that extends abstract Tokenizer class.
> 
> Thank you for your help
> Philippe
> 
> 
> Le 14/10/2020 ? 11:00, Christian Gr?n a ?crit?:
>> Hi Philippe,
>> 
>> Thanks for your mail in private, in which I already gave you a little
>> assessment on what might be necessary to include the CJK tokenizers in
>> BaseX:
>> 
>> The existing Apache code can be adapted and embedded into the BaseX
>> tokenizer infrastructure. On code level, an additional class needs to
>> be implemented that extends abstract Tokenizer class [1].
>> 
>> As far as I can judge, the 3 Lucene CJK analyzers could all be applied
>> to traditional and simplified Chinese. If we found someone who could
>> rate the linguistic quality of our full-text search results, that?d
>> surely be helpful.
>> 
>> Hope this helps,
>> Christian
>> 
>> [1] 
>> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/ft/Tokenizer.java
>> 
>> 
>> 
>> On Tue, Oct 13, 2020 at 12:32 PM Philippe Pons
>> <philippe.p...@college-de-france.fr> wrote:
>>> Dear Christian,
>>> 
>>> Thank you very much for this quick and enlightening response.
>>> 
>>> Without having had (yet) the opportunity to test it, I have indeed read the 
>>> Japanese text tokenizer.
>>> Supporting Chinese tokenization would also be a great help.
>>> 
>>> I have never tested what Lucene offers, especially since I have to manage 
>>> texts in traditional Chinese and simplified Chinese (without reading either 
>>> one myself).
>>> I would like to test Lucene's analyzers, but I don't know how to do it in 
>>> BaseX?
>>> 
>>> Best regards,
>>> Philippe Pons
>>> 
>>> 
>>> 
>>> Le 12/10/2020 ? 12:01, Christian Gr?n a ?crit :
>>> 
>>> Dear Philippe,
>>> 
>>> As the Chinese language rarely uses inflection, there is usually no
>>> need to perform stemming on texts. However, tokenization will be
>>> necessary indeed. Right now, BaseX provides no tokenizer/analyzer for
>>> Chinese texts. It should be possible indeed to adopt code from Lucene,
>>> as we?ve already done for other languages (our software licenses allow
>>> that).
>>> 
>>> Have you already worked with tokenization of Chinese texts in Lucene?
>>> If yes, which of the 3 available analyzers [1] have proven to yield
>>> the best results?
>>> 
>>> As you may know, one of our users, Toshio HIRAI, has contributed a
>>> tokenizer for Japanes texts in the past [2]. If we decide to include
>>> support for Chinese tokenization, it might as well be interesting to
>>> compare the results of the Apache tokenizer with our internal
>>> tokenizer.
>>> 
>>> Cordiales salutations,
>>> Christian
>>> 
>>> [1] 
>>> https://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/cjk/package-summary.html
>>> [2] https://docs.basex.org/wiki/Full-Text:_Japanese
>>> 
>>> 
>>> 
>>> On Mon, Oct 12, 2020 at 11:37 AM Philippe Pons
>>> <philippe.p...@college-de-france.fr> wrote:
>>> 
>>> Dear BaseX Team,
>>> 
>>> I'm actually working on chinese texts in TEI.
>>> I would like to know if stemming chinese text is possible in BaseX, as we 
>>> can do with other languages (like english or deutsch)?
>>> Or maybe there is a way to add this functionnality with Lucene?
>>> 
>>> Best regards,
>>> Philippe Pons
>>> 
>>> --
>>> Ing?nieur d'?tude charg? de l'?dition de corpus num?riques
>>> Centre de recherche sur les civilisations de l'Asie Orientale
>>> CRCAO - UMR 8155 (Coll?ge de France, EPHE, CNRS, PSL Research University, 
>>> Univ Paris Diderot, Sorbonne Paris Cit?)
>>> 49bis avenue de la Belle Gabrielle
>>> 75012 Paris
>>> https://cv.archives-ouvertes.fr/ppons
>>> 
>>> 
> 
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> <http://mailman.uni-konstanz.de/pipermail/basex-talk/attachments/20201014/ba20a947/attachment-0001.htm>
> 
> End of BaseX-Talk Digest, Vol 130, Issue 8
> ******************************************

[basex-talk] stemming chinese texts

Reply via email to