Dear Philippe,

As the Chinese language rarely uses inflection, there is usually no
need to perform stemming on texts. However, tokenization will be
necessary indeed. Right now, BaseX provides no tokenizer/analyzer for
Chinese texts. It should be possible indeed to adopt code from Lucene,
as we’ve already done for other languages (our software licenses allow
that).

Have you already worked with tokenization of Chinese texts in Lucene?
If yes, which of the 3 available analyzers [1] have proven to yield
the best results?

As you may know, one of our users, Toshio HIRAI, has contributed a
tokenizer for Japanes texts in the past [2]. If we decide to include
support for Chinese tokenization, it might as well be interesting to
compare the results of the Apache tokenizer with our internal
tokenizer.

Cordiales salutations,
Christian

[1] 
https://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/cjk/package-summary.html
[2] https://docs.basex.org/wiki/Full-Text:_Japanese



On Mon, Oct 12, 2020 at 11:37 AM Philippe Pons
<philippe.p...@college-de-france.fr> wrote:
>
> Dear BaseX Team,
>
> I'm actually working on chinese texts in TEI.
> I would like to know if stemming chinese text is possible in BaseX, as we can 
> do with other languages (like english or deutsch)?
> Or maybe there is a way to add this functionnality with Lucene?
>
> Best regards,
> Philippe Pons
>
> --
> Ingénieur d'étude chargé de l'édition de corpus numériques
> Centre de recherche sur les civilisations de l'Asie Orientale
> CRCAO - UMR 8155 (Collège de France, EPHE, CNRS, PSL Research University, 
> Univ Paris Diderot, Sorbonne Paris Cité)
> 49bis avenue de la Belle Gabrielle
> 75012 Paris
> https://cv.archives-ouvertes.fr/ppons

Reply via email to