Dear Christian,
I’d be happy to chime in on the quality of basexs Chinese language full-text capabilities. Chinese sources are my primary research area. What exactly do you have in mind? Greetings Duncan Ceterum censeo exist-db.org esse conriganda > > > Today's Topics: > > 1. Re: stemming chinese texts (Philippe Pons) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 14 Oct 2020 12:30:59 +0200 > From: Philippe Pons <philippe.p...@college-de-france.fr> > To: basex-talk@mailman.uni-konstanz.de > Subject: Re: [basex-talk] stemming chinese texts > Message-ID: > <d40e4b6e-29ab-f62f-1617-505db18e9...@college-de-france.fr> > Content-Type: text/plain; charset="windows-1252"; Format="flowed" > > Hi Christian, > > I suppose some of my colleagues would be able to judge the quality of > your full-text search results. > > On the other hand, on code level, I'm not sure I know how to implement > an additionnal class that extends abstract Tokenizer class. > > Thank you for your help > Philippe > > > Le 14/10/2020 ? 11:00, Christian Gr?n a ?crit?: >> Hi Philippe, >> >> Thanks for your mail in private, in which I already gave you a little >> assessment on what might be necessary to include the CJK tokenizers in >> BaseX: >> >> The existing Apache code can be adapted and embedded into the BaseX >> tokenizer infrastructure. On code level, an additional class needs to >> be implemented that extends abstract Tokenizer class [1]. >> >> As far as I can judge, the 3 Lucene CJK analyzers could all be applied >> to traditional and simplified Chinese. If we found someone who could >> rate the linguistic quality of our full-text search results, that?d >> surely be helpful. >> >> Hope this helps, >> Christian >> >> [1] >> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/ft/Tokenizer.java >> >> >> >> On Tue, Oct 13, 2020 at 12:32 PM Philippe Pons >> <philippe.p...@college-de-france.fr> wrote: >>> Dear Christian, >>> >>> Thank you very much for this quick and enlightening response. >>> >>> Without having had (yet) the opportunity to test it, I have indeed read the >>> Japanese text tokenizer. >>> Supporting Chinese tokenization would also be a great help. >>> >>> I have never tested what Lucene offers, especially since I have to manage >>> texts in traditional Chinese and simplified Chinese (without reading either >>> one myself). >>> I would like to test Lucene's analyzers, but I don't know how to do it in >>> BaseX? >>> >>> Best regards, >>> Philippe Pons >>> >>> >>> >>> Le 12/10/2020 ? 12:01, Christian Gr?n a ?crit : >>> >>> Dear Philippe, >>> >>> As the Chinese language rarely uses inflection, there is usually no >>> need to perform stemming on texts. However, tokenization will be >>> necessary indeed. Right now, BaseX provides no tokenizer/analyzer for >>> Chinese texts. It should be possible indeed to adopt code from Lucene, >>> as we?ve already done for other languages (our software licenses allow >>> that). >>> >>> Have you already worked with tokenization of Chinese texts in Lucene? >>> If yes, which of the 3 available analyzers [1] have proven to yield >>> the best results? >>> >>> As you may know, one of our users, Toshio HIRAI, has contributed a >>> tokenizer for Japanes texts in the past [2]. If we decide to include >>> support for Chinese tokenization, it might as well be interesting to >>> compare the results of the Apache tokenizer with our internal >>> tokenizer. >>> >>> Cordiales salutations, >>> Christian >>> >>> [1] >>> https://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/cjk/package-summary.html >>> [2] https://docs.basex.org/wiki/Full-Text:_Japanese >>> >>> >>> >>> On Mon, Oct 12, 2020 at 11:37 AM Philippe Pons >>> <philippe.p...@college-de-france.fr> wrote: >>> >>> Dear BaseX Team, >>> >>> I'm actually working on chinese texts in TEI. >>> I would like to know if stemming chinese text is possible in BaseX, as we >>> can do with other languages (like english or deutsch)? >>> Or maybe there is a way to add this functionnality with Lucene? >>> >>> Best regards, >>> Philippe Pons >>> >>> -- >>> Ing?nieur d'?tude charg? de l'?dition de corpus num?riques >>> Centre de recherche sur les civilisations de l'Asie Orientale >>> CRCAO - UMR 8155 (Coll?ge de France, EPHE, CNRS, PSL Research University, >>> Univ Paris Diderot, Sorbonne Paris Cit?) >>> 49bis avenue de la Belle Gabrielle >>> 75012 Paris >>> https://cv.archives-ouvertes.fr/ppons >>> >>> > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > <http://mailman.uni-konstanz.de/pipermail/basex-talk/attachments/20201014/ba20a947/attachment-0001.htm> > > End of BaseX-Talk Digest, Vol 130, Issue 8 > ******************************************