Dear All, 

Each of the Lucene analysers has their use case, and I have used them all. 
Maybe we can arrange a short video call to discuss what you want to achieve, 
and where the pitfalls might be. 

Generally, I would advise to not  think about the Japanese analyser all that 
much, Chinese is very different, so many of its features simply don’t apply. 


P.S.: I m located in Germany, and generally available for a call on Wednesdays. 

Sent from my iPad

> On 18. Oct 2020, at 16:01, Christian Grün <> wrote:
> Hi Duncan,
> Thanks for offering your help, that’s appreciated.
> We could add Lucene’s CJK analyzers in BaseX, and either embed it or
> provide it as library, similar to the Japanese tokenizer. Have you
> already used the Lucene analyzers [1], and if so, which of the 3
> provided analyzers would you recommend?
> Or have you even realized full-text search with BaseX and Chinese texts?
> Cheers,
> Christian
> [1] 
>> On Thu, Oct 15, 2020 at 1:01 PM Duncan Paterson <> wrote:
>> Dear Christian,
>> I’d be happy to chime in on the quality of basexs Chinese language full-text 
>> capabilities. Chinese sources are my primary research area. What exactly do 
>> you have in mind?
>> Greetings
>> Duncan
>> Ceterum censeo esse conriganda
>> Today's Topics:
>>  1. Re: stemming chinese texts (Philippe Pons)
>> ----------------------------------------------------------------------
>> Message: 1
>> Date: Wed, 14 Oct 2020 12:30:59 +0200
>> From: Philippe Pons <>
>> To:
>> Subject: Re: [basex-talk] stemming chinese texts
>> Message-ID:
>> <>
>> Content-Type: text/plain; charset="windows-1252"; Format="flowed"
>> Hi Christian,
>> I suppose some of my colleagues would be able to judge the quality of
>> your full-text search results.
>> On the other hand, on code level, I'm not sure I know how to implement
>> an additionnal class that extends abstract Tokenizer class.
>> Thank you for your help
>> Philippe
>> Le 14/10/2020 ? 11:00, Christian Gr?n a ?crit?:
>> Hi Philippe,
>> Thanks for your mail in private, in which I already gave you a little
>> assessment on what might be necessary to include the CJK tokenizers in
>> BaseX:
>> The existing Apache code can be adapted and embedded into the BaseX
>> tokenizer infrastructure. On code level, an additional class needs to
>> be implemented that extends abstract Tokenizer class [1].
>> As far as I can judge, the 3 Lucene CJK analyzers could all be applied
>> to traditional and simplified Chinese. If we found someone who could
>> rate the linguistic quality of our full-text search results, that?d
>> surely be helpful.
>> Hope this helps,
>> Christian
>> [1] 
>> On Tue, Oct 13, 2020 at 12:32 PM Philippe Pons
>> <> wrote:
>> Dear Christian,
>> Thank you very much for this quick and enlightening response.
>> Without having had (yet) the opportunity to test it, I have indeed read the 
>> Japanese text tokenizer.
>> Supporting Chinese tokenization would also be a great help.
>> I have never tested what Lucene offers, especially since I have to manage 
>> texts in traditional Chinese and simplified Chinese (without reading either 
>> one myself).
>> I would like to test Lucene's analyzers, but I don't know how to do it in 
>> BaseX?
>> Best regards,
>> Philippe Pons
>> Le 12/10/2020 ? 12:01, Christian Gr?n a ?crit :
>> Dear Philippe,
>> As the Chinese language rarely uses inflection, there is usually no
>> need to perform stemming on texts. However, tokenization will be
>> necessary indeed. Right now, BaseX provides no tokenizer/analyzer for
>> Chinese texts. It should be possible indeed to adopt code from Lucene,
>> as we?ve already done for other languages (our software licenses allow
>> that).
>> Have you already worked with tokenization of Chinese texts in Lucene?
>> If yes, which of the 3 available analyzers [1] have proven to yield
>> the best results?
>> As you may know, one of our users, Toshio HIRAI, has contributed a
>> tokenizer for Japanes texts in the past [2]. If we decide to include
>> support for Chinese tokenization, it might as well be interesting to
>> compare the results of the Apache tokenizer with our internal
>> tokenizer.
>> Cordiales salutations,
>> Christian
>> [1] 
>> [2]
>> On Mon, Oct 12, 2020 at 11:37 AM Philippe Pons
>> <> wrote:
>> Dear BaseX Team,
>> I'm actually working on chinese texts in TEI.
>> I would like to know if stemming chinese text is possible in BaseX, as we 
>> can do with other languages (like english or deutsch)?
>> Or maybe there is a way to add this functionnality with Lucene?
>> Best regards,
>> Philippe Pons
>> --
>> Ing?nieur d'?tude charg? de l'?dition de corpus num?riques
>> Centre de recherche sur les civilisations de l'Asie Orientale
>> CRCAO - UMR 8155 (Coll?ge de France, EPHE, CNRS, PSL Research University, 
>> Univ Paris Diderot, Sorbonne Paris Cit?)
>> 49bis avenue de la Belle Gabrielle
>> 75012 Paris
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: 
>> <>
>> End of BaseX-Talk Digest, Vol 130, Issue 8
>> ******************************************

Reply via email to