Dear All, Each of the Lucene analysers has their use case, and I have used them all. Maybe we can arrange a short video call to discuss what you want to achieve, and where the pitfalls might be.
Generally, I would advise to not think about the Japanese analyser all that much, Chinese is very different, so many of its features simply don’t apply. Greetings Duncan P.S.: I m located in Germany, and generally available for a call on Wednesdays. Sent from my iPad > On 18. Oct 2020, at 16:01, Christian Grün <christian.gr...@gmail.com> wrote: > > Hi Duncan, > > Thanks for offering your help, that’s appreciated. > > We could add Lucene’s CJK analyzers in BaseX, and either embed it or > provide it as library, similar to the Japanese tokenizer. Have you > already used the Lucene analyzers [1], and if so, which of the 3 > provided analyzers would you recommend? > > Or have you even realized full-text search with BaseX and Chinese texts? > > Cheers, > Christian > > [1] > https://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/cjk/package-summary.html > > > >> On Thu, Oct 15, 2020 at 1:01 PM Duncan Paterson <dun...@exist-db.org> wrote: >> >> Dear Christian, >> >> >> I’d be happy to chime in on the quality of basexs Chinese language full-text >> capabilities. Chinese sources are my primary research area. What exactly do >> you have in mind? >> >> Greetings >> Duncan >> >> Ceterum censeo exist-db.org esse conriganda >> >> >> >> >> >> Today's Topics: >> >> 1. Re: stemming chinese texts (Philippe Pons) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Wed, 14 Oct 2020 12:30:59 +0200 >> From: Philippe Pons <philippe.p...@college-de-france.fr> >> To: basex-talk@mailman.uni-konstanz.de >> Subject: Re: [basex-talk] stemming chinese texts >> Message-ID: >> <d40e4b6e-29ab-f62f-1617-505db18e9...@college-de-france.fr> >> Content-Type: text/plain; charset="windows-1252"; Format="flowed" >> >> Hi Christian, >> >> I suppose some of my colleagues would be able to judge the quality of >> your full-text search results. >> >> On the other hand, on code level, I'm not sure I know how to implement >> an additionnal class that extends abstract Tokenizer class. >> >> Thank you for your help >> Philippe >> >> >> Le 14/10/2020 ? 11:00, Christian Gr?n a ?crit?: >> >> Hi Philippe, >> >> Thanks for your mail in private, in which I already gave you a little >> assessment on what might be necessary to include the CJK tokenizers in >> BaseX: >> >> The existing Apache code can be adapted and embedded into the BaseX >> tokenizer infrastructure. On code level, an additional class needs to >> be implemented that extends abstract Tokenizer class [1]. >> >> As far as I can judge, the 3 Lucene CJK analyzers could all be applied >> to traditional and simplified Chinese. If we found someone who could >> rate the linguistic quality of our full-text search results, that?d >> surely be helpful. >> >> Hope this helps, >> Christian >> >> [1] >> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/ft/Tokenizer.java >> >> >> >> On Tue, Oct 13, 2020 at 12:32 PM Philippe Pons >> <philippe.p...@college-de-france.fr> wrote: >> >> Dear Christian, >> >> Thank you very much for this quick and enlightening response. >> >> Without having had (yet) the opportunity to test it, I have indeed read the >> Japanese text tokenizer. >> Supporting Chinese tokenization would also be a great help. >> >> I have never tested what Lucene offers, especially since I have to manage >> texts in traditional Chinese and simplified Chinese (without reading either >> one myself). >> I would like to test Lucene's analyzers, but I don't know how to do it in >> BaseX? >> >> Best regards, >> Philippe Pons >> >> >> >> Le 12/10/2020 ? 12:01, Christian Gr?n a ?crit : >> >> Dear Philippe, >> >> As the Chinese language rarely uses inflection, there is usually no >> need to perform stemming on texts. However, tokenization will be >> necessary indeed. Right now, BaseX provides no tokenizer/analyzer for >> Chinese texts. It should be possible indeed to adopt code from Lucene, >> as we?ve already done for other languages (our software licenses allow >> that). >> >> Have you already worked with tokenization of Chinese texts in Lucene? >> If yes, which of the 3 available analyzers [1] have proven to yield >> the best results? >> >> As you may know, one of our users, Toshio HIRAI, has contributed a >> tokenizer for Japanes texts in the past [2]. If we decide to include >> support for Chinese tokenization, it might as well be interesting to >> compare the results of the Apache tokenizer with our internal >> tokenizer. >> >> Cordiales salutations, >> Christian >> >> [1] >> https://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/cjk/package-summary.html >> [2] https://docs.basex.org/wiki/Full-Text:_Japanese >> >> >> >> On Mon, Oct 12, 2020 at 11:37 AM Philippe Pons >> <philippe.p...@college-de-france.fr> wrote: >> >> Dear BaseX Team, >> >> I'm actually working on chinese texts in TEI. >> I would like to know if stemming chinese text is possible in BaseX, as we >> can do with other languages (like english or deutsch)? >> Or maybe there is a way to add this functionnality with Lucene? >> >> Best regards, >> Philippe Pons >> >> -- >> Ing?nieur d'?tude charg? de l'?dition de corpus num?riques >> Centre de recherche sur les civilisations de l'Asie Orientale >> CRCAO - UMR 8155 (Coll?ge de France, EPHE, CNRS, PSL Research University, >> Univ Paris Diderot, Sorbonne Paris Cit?) >> 49bis avenue de la Belle Gabrielle >> 75012 Paris >> https://cv.archives-ouvertes.fr/ppons >> >> >> >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: >> <http://mailman.uni-konstanz.de/pipermail/basex-talk/attachments/20201014/ba20a947/attachment-0001.htm> >> >> End of BaseX-Talk Digest, Vol 130, Issue 8 >> ****************************************** >> >>