Hi Philippe,

Thanks for your mail in private, in which I already gave you a little
assessment on what might be necessary to include the CJK tokenizers in
BaseX:

The existing Apache code can be adapted and embedded into the BaseX
tokenizer infrastructure. On code level, an additional class needs to
be implemented that extends abstract Tokenizer class [1].

As far as I can judge, the 3 Lucene CJK analyzers could all be applied
to traditional and simplified Chinese. If we found someone who could
rate the linguistic quality of our full-text search results, that’d
surely be helpful.

Hope this helps,
Christian

[1] 
https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/ft/Tokenizer.java



On Tue, Oct 13, 2020 at 12:32 PM Philippe Pons
<philippe.p...@college-de-france.fr> wrote:
>
> Dear Christian,
>
> Thank you very much for this quick and enlightening response.
>
> Without having had (yet) the opportunity to test it, I have indeed read the 
> Japanese text tokenizer.
> Supporting Chinese tokenization would also be a great help.
>
> I have never tested what Lucene offers, especially since I have to manage 
> texts in traditional Chinese and simplified Chinese (without reading either 
> one myself).
> I would like to test Lucene's analyzers, but I don't know how to do it in 
> BaseX?
>
> Best regards,
> Philippe Pons
>
>
>
> Le 12/10/2020 à 12:01, Christian Grün a écrit :
>
> Dear Philippe,
>
> As the Chinese language rarely uses inflection, there is usually no
> need to perform stemming on texts. However, tokenization will be
> necessary indeed. Right now, BaseX provides no tokenizer/analyzer for
> Chinese texts. It should be possible indeed to adopt code from Lucene,
> as we’ve already done for other languages (our software licenses allow
> that).
>
> Have you already worked with tokenization of Chinese texts in Lucene?
> If yes, which of the 3 available analyzers [1] have proven to yield
> the best results?
>
> As you may know, one of our users, Toshio HIRAI, has contributed a
> tokenizer for Japanes texts in the past [2]. If we decide to include
> support for Chinese tokenization, it might as well be interesting to
> compare the results of the Apache tokenizer with our internal
> tokenizer.
>
> Cordiales salutations,
> Christian
>
> [1] 
> https://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/cjk/package-summary.html
> [2] https://docs.basex.org/wiki/Full-Text:_Japanese
>
>
>
> On Mon, Oct 12, 2020 at 11:37 AM Philippe Pons
> <philippe.p...@college-de-france.fr> wrote:
>
> Dear BaseX Team,
>
> I'm actually working on chinese texts in TEI.
> I would like to know if stemming chinese text is possible in BaseX, as we can 
> do with other languages (like english or deutsch)?
> Or maybe there is a way to add this functionnality with Lucene?
>
> Best regards,
> Philippe Pons
>
> --
> Ingénieur d'étude chargé de l'édition de corpus numériques
> Centre de recherche sur les civilisations de l'Asie Orientale
> CRCAO - UMR 8155 (Collège de France, EPHE, CNRS, PSL Research University, 
> Univ Paris Diderot, Sorbonne Paris Cité)
> 49bis avenue de la Belle Gabrielle
> 75012 Paris
> https://cv.archives-ouvertes.fr/ppons
>
>

Reply via email to