Re: [basex-talk] stemming chinese texts

Duncan Paterson Mon, 19 Oct 2020 01:12:03 -0700

Dear All, 

Each of the Lucene analysers has their use case, and I have used them all. 
Maybe we can arrange a short video call to discuss what you want to achieve, 
and where the pitfalls might be.


Generally, I would advise to not  think about the Japanese analyser all that 
much, Chinese is very different, so many of its features simply don’t apply. 

Greetings 
Duncan

P.S.: I m located in Germany, and generally available for a call on Wednesdays. 

Sent from my iPad

> On 18. Oct 2020, at 16:01, Christian Grün <christian.gr...@gmail.com> wrote:
> 
> Hi Duncan,
> 
> Thanks for offering your help, that’s appreciated.
> 
> We could add Lucene’s CJK analyzers in BaseX, and either embed it or
> provide it as library, similar to the Japanese tokenizer. Have you
> already used the Lucene analyzers [1], and if so, which of the 3
> provided analyzers would you recommend?
> 
> Or have you even realized full-text search with BaseX and Chinese texts?
> 
> Cheers,
> Christian
> 
> [1] 
> https://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/cjk/package-summary.html
> 
> 
> 
>> On Thu, Oct 15, 2020 at 1:01 PM Duncan Paterson <dun...@exist-db.org> wrote:
>> 
>> Dear Christian,
>> 
>> 
>> I’d be happy to chime in on the quality of basexs Chinese language full-text 
>> capabilities. Chinese sources are my primary research area. What exactly do 
>> you have in mind?
>> 
>> Greetings
>> Duncan
>> 
>> Ceterum censeo exist-db.org esse conriganda
>> 
>> 
>> 
>> 
>> 
>> Today's Topics:
>> 
>>  1. Re: stemming chinese texts (Philippe Pons)
>> 
>> 
>> ----------------------------------------------------------------------
>> 
>> Message: 1
>> Date: Wed, 14 Oct 2020 12:30:59 +0200
>> From: Philippe Pons <philippe.p...@college-de-france.fr>
>> To: basex-talk@mailman.uni-konstanz.de
>> Subject: Re: [basex-talk] stemming chinese texts
>> Message-ID:
>> <d40e4b6e-29ab-f62f-1617-505db18e9...@college-de-france.fr>
>> Content-Type: text/plain; charset="windows-1252"; Format="flowed"
>> 
>> Hi Christian,
>> 
>> I suppose some of my colleagues would be able to judge the quality of
>> your full-text search results.
>> 
>> On the other hand, on code level, I'm not sure I know how to implement
>> an additionnal class that extends abstract Tokenizer class.
>> 
>> Thank you for your help
>> Philippe
>> 
>> 
>> Le 14/10/2020 ? 11:00, Christian Gr?n a ?crit?:
>> 
>> Hi Philippe,
>> 
>> Thanks for your mail in private, in which I already gave you a little
>> assessment on what might be necessary to include the CJK tokenizers in
>> BaseX:
>> 
>> The existing Apache code can be adapted and embedded into the BaseX
>> tokenizer infrastructure. On code level, an additional class needs to
>> be implemented that extends abstract Tokenizer class [1].
>> 
>> As far as I can judge, the 3 Lucene CJK analyzers could all be applied
>> to traditional and simplified Chinese. If we found someone who could
>> rate the linguistic quality of our full-text search results, that?d
>> surely be helpful.
>> 
>> Hope this helps,
>> Christian
>> 
>> [1] 
>> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/ft/Tokenizer.java
>> 
>> 
>> 
>> On Tue, Oct 13, 2020 at 12:32 PM Philippe Pons
>> <philippe.p...@college-de-france.fr> wrote:
>> 
>> Dear Christian,
>> 
>> Thank you very much for this quick and enlightening response.
>> 
>> Without having had (yet) the opportunity to test it, I have indeed read the 
>> Japanese text tokenizer.
>> Supporting Chinese tokenization would also be a great help.
>> 
>> I have never tested what Lucene offers, especially since I have to manage 
>> texts in traditional Chinese and simplified Chinese (without reading either 
>> one myself).
>> I would like to test Lucene's analyzers, but I don't know how to do it in 
>> BaseX?
>> 
>> Best regards,
>> Philippe Pons
>> 
>> 
>> 
>> Le 12/10/2020 ? 12:01, Christian Gr?n a ?crit :
>> 
>> Dear Philippe,
>> 
>> As the Chinese language rarely uses inflection, there is usually no
>> need to perform stemming on texts. However, tokenization will be
>> necessary indeed. Right now, BaseX provides no tokenizer/analyzer for
>> Chinese texts. It should be possible indeed to adopt code from Lucene,
>> as we?ve already done for other languages (our software licenses allow
>> that).
>> 
>> Have you already worked with tokenization of Chinese texts in Lucene?
>> If yes, which of the 3 available analyzers [1] have proven to yield
>> the best results?
>> 
>> As you may know, one of our users, Toshio HIRAI, has contributed a
>> tokenizer for Japanes texts in the past [2]. If we decide to include
>> support for Chinese tokenization, it might as well be interesting to
>> compare the results of the Apache tokenizer with our internal
>> tokenizer.
>> 
>> Cordiales salutations,
>> Christian
>> 
>> [1] 
>> https://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/cjk/package-summary.html
>> [2] https://docs.basex.org/wiki/Full-Text:_Japanese
>> 
>> 
>> 
>> On Mon, Oct 12, 2020 at 11:37 AM Philippe Pons
>> <philippe.p...@college-de-france.fr> wrote:
>> 
>> Dear BaseX Team,
>> 
>> I'm actually working on chinese texts in TEI.
>> I would like to know if stemming chinese text is possible in BaseX, as we 
>> can do with other languages (like english or deutsch)?
>> Or maybe there is a way to add this functionnality with Lucene?
>> 
>> Best regards,
>> Philippe Pons
>> 
>> --
>> Ing?nieur d'?tude charg? de l'?dition de corpus num?riques
>> Centre de recherche sur les civilisations de l'Asie Orientale
>> CRCAO - UMR 8155 (Coll?ge de France, EPHE, CNRS, PSL Research University, 
>> Univ Paris Diderot, Sorbonne Paris Cit?)
>> 49bis avenue de la Belle Gabrielle
>> 75012 Paris
>> https://cv.archives-ouvertes.fr/ppons
>> 
>> 
>> 
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: 
>> <http://mailman.uni-konstanz.de/pipermail/basex-talk/attachments/20201014/ba20a947/attachment-0001.htm>
>> 
>> End of BaseX-Talk Digest, Vol 130, Issue 8
>> ******************************************
>> 
>>

Re: [basex-talk] stemming chinese texts

Reply via email to