Hi Philippe, Thanks for your mail in private, in which I already gave you a little assessment on what might be necessary to include the CJK tokenizers in BaseX:
The existing Apache code can be adapted and embedded into the BaseX tokenizer infrastructure. On code level, an additional class needs to be implemented that extends abstract Tokenizer class [1]. As far as I can judge, the 3 Lucene CJK analyzers could all be applied to traditional and simplified Chinese. If we found someone who could rate the linguistic quality of our full-text search results, that’d surely be helpful. Hope this helps, Christian [1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/ft/Tokenizer.java On Tue, Oct 13, 2020 at 12:32 PM Philippe Pons <philippe.p...@college-de-france.fr> wrote: > > Dear Christian, > > Thank you very much for this quick and enlightening response. > > Without having had (yet) the opportunity to test it, I have indeed read the > Japanese text tokenizer. > Supporting Chinese tokenization would also be a great help. > > I have never tested what Lucene offers, especially since I have to manage > texts in traditional Chinese and simplified Chinese (without reading either > one myself). > I would like to test Lucene's analyzers, but I don't know how to do it in > BaseX? > > Best regards, > Philippe Pons > > > > Le 12/10/2020 à 12:01, Christian Grün a écrit : > > Dear Philippe, > > As the Chinese language rarely uses inflection, there is usually no > need to perform stemming on texts. However, tokenization will be > necessary indeed. Right now, BaseX provides no tokenizer/analyzer for > Chinese texts. It should be possible indeed to adopt code from Lucene, > as we’ve already done for other languages (our software licenses allow > that). > > Have you already worked with tokenization of Chinese texts in Lucene? > If yes, which of the 3 available analyzers [1] have proven to yield > the best results? > > As you may know, one of our users, Toshio HIRAI, has contributed a > tokenizer for Japanes texts in the past [2]. If we decide to include > support for Chinese tokenization, it might as well be interesting to > compare the results of the Apache tokenizer with our internal > tokenizer. > > Cordiales salutations, > Christian > > [1] > https://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/cjk/package-summary.html > [2] https://docs.basex.org/wiki/Full-Text:_Japanese > > > > On Mon, Oct 12, 2020 at 11:37 AM Philippe Pons > <philippe.p...@college-de-france.fr> wrote: > > Dear BaseX Team, > > I'm actually working on chinese texts in TEI. > I would like to know if stemming chinese text is possible in BaseX, as we can > do with other languages (like english or deutsch)? > Or maybe there is a way to add this functionnality with Lucene? > > Best regards, > Philippe Pons > > -- > Ingénieur d'étude chargé de l'édition de corpus numériques > Centre de recherche sur les civilisations de l'Asie Orientale > CRCAO - UMR 8155 (Collège de France, EPHE, CNRS, PSL Research University, > Univ Paris Diderot, Sorbonne Paris Cité) > 49bis avenue de la Belle Gabrielle > 75012 Paris > https://cv.archives-ouvertes.fr/ppons > >