Re: [basex-talk] stemming chinese texts

2020-10-19 Thread Duncan Paterson
>> Duncan >> >> Ceterum censeo exist-db.org esse conriganda >> >> >> >> >> >> Today's Topics: >> >> 1. Re: stemming chinese texts (Philippe Pons) >> >> >> --------------

Re: [basex-talk] stemming chinese texts

2020-10-18 Thread Christian Grün
; > > > Today's Topics: > > 1. Re: stemming chinese texts (Philippe Pons) > > > -- > > Message: 1 > Date: Wed, 14 Oct 2020 12:30:59 +0200 > From: Philippe Pons > To: basex-talk@mailman.uni-

[basex-talk] stemming chinese texts

2020-10-15 Thread Duncan Paterson
: > > 1. Re: stemming chinese texts (Philippe Pons) > > > -- > > Message: 1 > Date: Wed, 14 Oct 2020 12:30:59 +0200 > From: Philippe Pons > To: basex-talk@mailman.uni-konstanz.de > Subject: Re: [

Re: [basex-talk] stemming chinese texts

2020-10-14 Thread Philippe Pons
Hi Christian, I suppose some of my colleagues would be able to judge the quality of your full-text search results. On the other hand, on code level, I'm not sure I know how to implement an additionnal class that extends abstract Tokenizer class. Thank you for your help Philippe Le 14/10/2

Re: [basex-talk] stemming chinese texts

2020-10-14 Thread Christian Grün
Hi Philippe, Thanks for your mail in private, in which I already gave you a little assessment on what might be necessary to include the CJK tokenizers in BaseX: The existing Apache code can be adapted and embedded into the BaseX tokenizer infrastructure. On code level, an additional class needs t

Re: [basex-talk] stemming chinese texts

2020-10-13 Thread Philippe Pons
Dear Christian, Thank you very much for this quick and enlightening response. Without having had (yet) the opportunity to test it, I have indeed read the Japanese text tokenizer. Supporting Chinese tokenization would also be a great help. I have never tested what Lucene offers, especially sin

Re: [basex-talk] stemming chinese texts

2020-10-12 Thread Christian Grün
Dear Philippe, As the Chinese language rarely uses inflection, there is usually no need to perform stemming on texts. However, tokenization will be necessary indeed. Right now, BaseX provides no tokenizer/analyzer for Chinese texts. It should be possible indeed to adopt code from Lucene, as we’ve

[basex-talk] stemming chinese texts

2020-10-12 Thread Philippe Pons
Dear BaseX Team, I'm actually working on chinese texts in TEI. I would like to know if stemming chinese text is possible in BaseX, as we can do with other languages (like english or deutsch)? Or maybe there is a way to add this functionnality with Lucene? Best regards, Philippe Pons -- Ingéni