Re: [basex-talk] stemming chinese texts

2020-10-19 Thread Duncan Paterson
Dear All, 

Each of the Lucene analysers has their use case, and I have used them all. 
Maybe we can arrange a short video call to discuss what you want to achieve, 
and where the pitfalls might be. 

Generally, I would advise to not  think about the Japanese analyser all that 
much, Chinese is very different, so many of its features simply don’t apply. 

Greetings 
Duncan

P.S.: I m located in Germany, and generally available for a call on Wednesdays. 

Sent from my iPad

> On 18. Oct 2020, at 16:01, Christian Grün  wrote:
> 
> Hi Duncan,
> 
> Thanks for offering your help, that’s appreciated.
> 
> We could add Lucene’s CJK analyzers in BaseX, and either embed it or
> provide it as library, similar to the Japanese tokenizer. Have you
> already used the Lucene analyzers [1], and if so, which of the 3
> provided analyzers would you recommend?
> 
> Or have you even realized full-text search with BaseX and Chinese texts?
> 
> Cheers,
> Christian
> 
> [1] 
> https://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/cjk/package-summary.html
> 
> 
> 
>> On Thu, Oct 15, 2020 at 1:01 PM Duncan Paterson  wrote:
>> 
>> Dear Christian,
>> 
>> 
>> I’d be happy to chime in on the quality of basexs Chinese language full-text 
>> capabilities. Chinese sources are my primary research area. What exactly do 
>> you have in mind?
>> 
>> Greetings
>> Duncan
>> 
>> Ceterum censeo exist-db.org esse conriganda
>> 
>> 
>> 
>> 
>> 
>> Today's Topics:
>> 
>>  1. Re: stemming chinese texts (Philippe Pons)
>> 
>> 
>> --------------
>> 
>> Message: 1
>> Date: Wed, 14 Oct 2020 12:30:59 +0200
>> From: Philippe Pons 
>> To: basex-talk@mailman.uni-konstanz.de
>> Subject: Re: [basex-talk] stemming chinese texts
>> Message-ID:
>> 
>> Content-Type: text/plain; charset="windows-1252"; Format="flowed"
>> 
>> Hi Christian,
>> 
>> I suppose some of my colleagues would be able to judge the quality of
>> your full-text search results.
>> 
>> On the other hand, on code level, I'm not sure I know how to implement
>> an additionnal class that extends abstract Tokenizer class.
>> 
>> Thank you for your help
>> Philippe
>> 
>> 
>> Le 14/10/2020 ? 11:00, Christian Gr?n a ?crit?:
>> 
>> Hi Philippe,
>> 
>> Thanks for your mail in private, in which I already gave you a little
>> assessment on what might be necessary to include the CJK tokenizers in
>> BaseX:
>> 
>> The existing Apache code can be adapted and embedded into the BaseX
>> tokenizer infrastructure. On code level, an additional class needs to
>> be implemented that extends abstract Tokenizer class [1].
>> 
>> As far as I can judge, the 3 Lucene CJK analyzers could all be applied
>> to traditional and simplified Chinese. If we found someone who could
>> rate the linguistic quality of our full-text search results, that?d
>> surely be helpful.
>> 
>> Hope this helps,
>> Christian
>> 
>> [1] 
>> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/ft/Tokenizer.java
>> 
>> 
>> 
>> On Tue, Oct 13, 2020 at 12:32 PM Philippe Pons
>>  wrote:
>> 
>> Dear Christian,
>> 
>> Thank you very much for this quick and enlightening response.
>> 
>> Without having had (yet) the opportunity to test it, I have indeed read the 
>> Japanese text tokenizer.
>> Supporting Chinese tokenization would also be a great help.
>> 
>> I have never tested what Lucene offers, especially since I have to manage 
>> texts in traditional Chinese and simplified Chinese (without reading either 
>> one myself).
>> I would like to test Lucene's analyzers, but I don't know how to do it in 
>> BaseX?
>> 
>> Best regards,
>> Philippe Pons
>> 
>> 
>> 
>> Le 12/10/2020 ? 12:01, Christian Gr?n a ?crit :
>> 
>> Dear Philippe,
>> 
>> As the Chinese language rarely uses inflection, there is usually no
>> need to perform stemming on texts. However, tokenization will be
>> necessary indeed. Right now, BaseX provides no tokenizer/analyzer for
>> Chinese texts. It should be possible indeed to adopt code from Lucene,
>> as we?ve already done for other languages (our software licenses allow
>> that).
>> 
>> Have you already worked with tokenization of Chinese texts in Lucene?
>> If yes, which of the 3 available analyzers [1] have

Re: [basex-talk] stemming chinese texts

2020-10-18 Thread Christian Grün
Hi Duncan,

Thanks for offering your help, that’s appreciated.

We could add Lucene’s CJK analyzers in BaseX, and either embed it or
provide it as library, similar to the Japanese tokenizer. Have you
already used the Lucene analyzers [1], and if so, which of the 3
provided analyzers would you recommend?

Or have you even realized full-text search with BaseX and Chinese texts?

Cheers,
Christian

[1] 
https://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/cjk/package-summary.html



On Thu, Oct 15, 2020 at 1:01 PM Duncan Paterson  wrote:
>
> Dear Christian,
>
>
> I’d be happy to chime in on the quality of basexs Chinese language full-text 
> capabilities. Chinese sources are my primary research area. What exactly do 
> you have in mind?
>
> Greetings
> Duncan
>
> Ceterum censeo exist-db.org esse conriganda
>
>
>
>
>
> Today's Topics:
>
>   1. Re: stemming chinese texts (Philippe Pons)
>
>
> --
>
> Message: 1
> Date: Wed, 14 Oct 2020 12:30:59 +0200
> From: Philippe Pons 
> To: basex-talk@mailman.uni-konstanz.de
> Subject: Re: [basex-talk] stemming chinese texts
> Message-ID:
> 
> Content-Type: text/plain; charset="windows-1252"; Format="flowed"
>
> Hi Christian,
>
> I suppose some of my colleagues would be able to judge the quality of
> your full-text search results.
>
> On the other hand, on code level, I'm not sure I know how to implement
> an additionnal class that extends abstract Tokenizer class.
>
> Thank you for your help
> Philippe
>
>
> Le 14/10/2020 ? 11:00, Christian Gr?n a ?crit?:
>
> Hi Philippe,
>
> Thanks for your mail in private, in which I already gave you a little
> assessment on what might be necessary to include the CJK tokenizers in
> BaseX:
>
> The existing Apache code can be adapted and embedded into the BaseX
> tokenizer infrastructure. On code level, an additional class needs to
> be implemented that extends abstract Tokenizer class [1].
>
> As far as I can judge, the 3 Lucene CJK analyzers could all be applied
> to traditional and simplified Chinese. If we found someone who could
> rate the linguistic quality of our full-text search results, that?d
> surely be helpful.
>
> Hope this helps,
> Christian
>
> [1] 
> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/ft/Tokenizer.java
>
>
>
> On Tue, Oct 13, 2020 at 12:32 PM Philippe Pons
>  wrote:
>
> Dear Christian,
>
> Thank you very much for this quick and enlightening response.
>
> Without having had (yet) the opportunity to test it, I have indeed read the 
> Japanese text tokenizer.
> Supporting Chinese tokenization would also be a great help.
>
> I have never tested what Lucene offers, especially since I have to manage 
> texts in traditional Chinese and simplified Chinese (without reading either 
> one myself).
> I would like to test Lucene's analyzers, but I don't know how to do it in 
> BaseX?
>
> Best regards,
> Philippe Pons
>
>
>
> Le 12/10/2020 ? 12:01, Christian Gr?n a ?crit :
>
> Dear Philippe,
>
> As the Chinese language rarely uses inflection, there is usually no
> need to perform stemming on texts. However, tokenization will be
> necessary indeed. Right now, BaseX provides no tokenizer/analyzer for
> Chinese texts. It should be possible indeed to adopt code from Lucene,
> as we?ve already done for other languages (our software licenses allow
> that).
>
> Have you already worked with tokenization of Chinese texts in Lucene?
> If yes, which of the 3 available analyzers [1] have proven to yield
> the best results?
>
> As you may know, one of our users, Toshio HIRAI, has contributed a
> tokenizer for Japanes texts in the past [2]. If we decide to include
> support for Chinese tokenization, it might as well be interesting to
> compare the results of the Apache tokenizer with our internal
> tokenizer.
>
> Cordiales salutations,
> Christian
>
> [1] 
> https://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/cjk/package-summary.html
> [2] https://docs.basex.org/wiki/Full-Text:_Japanese
>
>
>
> On Mon, Oct 12, 2020 at 11:37 AM Philippe Pons
>  wrote:
>
> Dear BaseX Team,
>
> I'm actually working on chinese texts in TEI.
> I would like to know if stemming chinese text is possible in BaseX, as we can 
> do with other languages (like english or deutsch)?
> Or maybe there is a way to add this functionnality with Lucene?
>
> Best regards,
> Philippe Pons
>
> --
> Ing?nieur d'?tude charg? de l'?dition de corpus num?riques
> Centre de recherche sur les civi

[basex-talk] stemming chinese texts

2020-10-15 Thread Duncan Paterson
Dear Christian, 


I’d be happy to chime in on the quality of basexs Chinese language full-text 
capabilities. Chinese sources are my primary research area. What exactly do you 
have in mind?

Greetings
Duncan

Ceterum censeo exist-db.org esse conriganda



> 
> 
> Today's Topics:
> 
>   1. Re: stemming chinese texts (Philippe Pons)
> 
> 
> --
> 
> Message: 1
> Date: Wed, 14 Oct 2020 12:30:59 +0200
> From: Philippe Pons 
> To: basex-talk@mailman.uni-konstanz.de
> Subject: Re: [basex-talk] stemming chinese texts
> Message-ID:
>   
> Content-Type: text/plain; charset="windows-1252"; Format="flowed"
> 
> Hi Christian,
> 
> I suppose some of my colleagues would be able to judge the quality of 
> your full-text search results.
> 
> On the other hand, on code level, I'm not sure I know how to implement 
> an additionnal class that extends abstract Tokenizer class.
> 
> Thank you for your help
> Philippe
> 
> 
> Le 14/10/2020 ? 11:00, Christian Gr?n a ?crit?:
>> Hi Philippe,
>> 
>> Thanks for your mail in private, in which I already gave you a little
>> assessment on what might be necessary to include the CJK tokenizers in
>> BaseX:
>> 
>> The existing Apache code can be adapted and embedded into the BaseX
>> tokenizer infrastructure. On code level, an additional class needs to
>> be implemented that extends abstract Tokenizer class [1].
>> 
>> As far as I can judge, the 3 Lucene CJK analyzers could all be applied
>> to traditional and simplified Chinese. If we found someone who could
>> rate the linguistic quality of our full-text search results, that?d
>> surely be helpful.
>> 
>> Hope this helps,
>> Christian
>> 
>> [1] 
>> https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/ft/Tokenizer.java
>> 
>> 
>> 
>> On Tue, Oct 13, 2020 at 12:32 PM Philippe Pons
>>  wrote:
>>> Dear Christian,
>>> 
>>> Thank you very much for this quick and enlightening response.
>>> 
>>> Without having had (yet) the opportunity to test it, I have indeed read the 
>>> Japanese text tokenizer.
>>> Supporting Chinese tokenization would also be a great help.
>>> 
>>> I have never tested what Lucene offers, especially since I have to manage 
>>> texts in traditional Chinese and simplified Chinese (without reading either 
>>> one myself).
>>> I would like to test Lucene's analyzers, but I don't know how to do it in 
>>> BaseX?
>>> 
>>> Best regards,
>>> Philippe Pons
>>> 
>>> 
>>> 
>>> Le 12/10/2020 ? 12:01, Christian Gr?n a ?crit :
>>> 
>>> Dear Philippe,
>>> 
>>> As the Chinese language rarely uses inflection, there is usually no
>>> need to perform stemming on texts. However, tokenization will be
>>> necessary indeed. Right now, BaseX provides no tokenizer/analyzer for
>>> Chinese texts. It should be possible indeed to adopt code from Lucene,
>>> as we?ve already done for other languages (our software licenses allow
>>> that).
>>> 
>>> Have you already worked with tokenization of Chinese texts in Lucene?
>>> If yes, which of the 3 available analyzers [1] have proven to yield
>>> the best results?
>>> 
>>> As you may know, one of our users, Toshio HIRAI, has contributed a
>>> tokenizer for Japanes texts in the past [2]. If we decide to include
>>> support for Chinese tokenization, it might as well be interesting to
>>> compare the results of the Apache tokenizer with our internal
>>> tokenizer.
>>> 
>>> Cordiales salutations,
>>> Christian
>>> 
>>> [1] 
>>> https://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/cjk/package-summary.html
>>> [2] https://docs.basex.org/wiki/Full-Text:_Japanese
>>> 
>>> 
>>> 
>>> On Mon, Oct 12, 2020 at 11:37 AM Philippe Pons
>>>  wrote:
>>> 
>>> Dear BaseX Team,
>>> 
>>> I'm actually working on chinese texts in TEI.
>>> I would like to know if stemming chinese text is possible in BaseX, as we 
>>> can do with other languages (like english or deutsch)?
>>> Or maybe there is a way to add this functionnality with Lucene?
>>> 
>>> Best regards,
>>> Philippe Pons
>>> 
>>> --
>>> Ing?nieur d'?tude charg? de l'?dition de corpus num?riques
>>> Centre de recherche sur les civilisations de l'Asie Orientale
>>> CRCAO - UMR 8155 (Coll?ge de France, EPHE, CNRS, PSL Research University, 
>>> Univ Paris Diderot, Sorbonne Paris Cit?)
>>> 49bis avenue de la Belle Gabrielle
>>> 75012 Paris
>>> https://cv.archives-ouvertes.fr/ppons
>>> 
>>> 
> 
> -- next part --
> An HTML attachment was scrubbed...
> URL: 
> <http://mailman.uni-konstanz.de/pipermail/basex-talk/attachments/20201014/ba20a947/attachment-0001.htm>
> 
> End of BaseX-Talk Digest, Vol 130, Issue 8
> **



Re: [basex-talk] stemming chinese texts

2020-10-14 Thread Philippe Pons

Hi Christian,

I suppose some of my colleagues would be able to judge the quality of 
your full-text search results.


On the other hand, on code level, I'm not sure I know how to implement 
an additionnal class that extends abstract Tokenizer class.


Thank you for your help
Philippe


Le 14/10/2020 à 11:00, Christian Grün a écrit :

Hi Philippe,

Thanks for your mail in private, in which I already gave you a little
assessment on what might be necessary to include the CJK tokenizers in
BaseX:

The existing Apache code can be adapted and embedded into the BaseX
tokenizer infrastructure. On code level, an additional class needs to
be implemented that extends abstract Tokenizer class [1].

As far as I can judge, the 3 Lucene CJK analyzers could all be applied
to traditional and simplified Chinese. If we found someone who could
rate the linguistic quality of our full-text search results, that’d
surely be helpful.

Hope this helps,
Christian

[1] 
https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/ft/Tokenizer.java



On Tue, Oct 13, 2020 at 12:32 PM Philippe Pons
 wrote:

Dear Christian,

Thank you very much for this quick and enlightening response.

Without having had (yet) the opportunity to test it, I have indeed read the 
Japanese text tokenizer.
Supporting Chinese tokenization would also be a great help.

I have never tested what Lucene offers, especially since I have to manage texts 
in traditional Chinese and simplified Chinese (without reading either one 
myself).
I would like to test Lucene's analyzers, but I don't know how to do it in BaseX?

Best regards,
Philippe Pons



Le 12/10/2020 à 12:01, Christian Grün a écrit :

Dear Philippe,

As the Chinese language rarely uses inflection, there is usually no
need to perform stemming on texts. However, tokenization will be
necessary indeed. Right now, BaseX provides no tokenizer/analyzer for
Chinese texts. It should be possible indeed to adopt code from Lucene,
as we’ve already done for other languages (our software licenses allow
that).

Have you already worked with tokenization of Chinese texts in Lucene?
If yes, which of the 3 available analyzers [1] have proven to yield
the best results?

As you may know, one of our users, Toshio HIRAI, has contributed a
tokenizer for Japanes texts in the past [2]. If we decide to include
support for Chinese tokenization, it might as well be interesting to
compare the results of the Apache tokenizer with our internal
tokenizer.

Cordiales salutations,
Christian

[1] 
https://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/cjk/package-summary.html
[2] https://docs.basex.org/wiki/Full-Text:_Japanese



On Mon, Oct 12, 2020 at 11:37 AM Philippe Pons
 wrote:

Dear BaseX Team,

I'm actually working on chinese texts in TEI.
I would like to know if stemming chinese text is possible in BaseX, as we can 
do with other languages (like english or deutsch)?
Or maybe there is a way to add this functionnality with Lucene?

Best regards,
Philippe Pons

--
Ingénieur d'étude chargé de l'édition de corpus numériques
Centre de recherche sur les civilisations de l'Asie Orientale
CRCAO - UMR 8155 (Collège de France, EPHE, CNRS, PSL Research University, Univ 
Paris Diderot, Sorbonne Paris Cité)
49bis avenue de la Belle Gabrielle
75012 Paris
https://cv.archives-ouvertes.fr/ppons






Re: [basex-talk] stemming chinese texts

2020-10-14 Thread Christian Grün
Hi Philippe,

Thanks for your mail in private, in which I already gave you a little
assessment on what might be necessary to include the CJK tokenizers in
BaseX:

The existing Apache code can be adapted and embedded into the BaseX
tokenizer infrastructure. On code level, an additional class needs to
be implemented that extends abstract Tokenizer class [1].

As far as I can judge, the 3 Lucene CJK analyzers could all be applied
to traditional and simplified Chinese. If we found someone who could
rate the linguistic quality of our full-text search results, that’d
surely be helpful.

Hope this helps,
Christian

[1] 
https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/ft/Tokenizer.java



On Tue, Oct 13, 2020 at 12:32 PM Philippe Pons
 wrote:
>
> Dear Christian,
>
> Thank you very much for this quick and enlightening response.
>
> Without having had (yet) the opportunity to test it, I have indeed read the 
> Japanese text tokenizer.
> Supporting Chinese tokenization would also be a great help.
>
> I have never tested what Lucene offers, especially since I have to manage 
> texts in traditional Chinese and simplified Chinese (without reading either 
> one myself).
> I would like to test Lucene's analyzers, but I don't know how to do it in 
> BaseX?
>
> Best regards,
> Philippe Pons
>
>
>
> Le 12/10/2020 à 12:01, Christian Grün a écrit :
>
> Dear Philippe,
>
> As the Chinese language rarely uses inflection, there is usually no
> need to perform stemming on texts. However, tokenization will be
> necessary indeed. Right now, BaseX provides no tokenizer/analyzer for
> Chinese texts. It should be possible indeed to adopt code from Lucene,
> as we’ve already done for other languages (our software licenses allow
> that).
>
> Have you already worked with tokenization of Chinese texts in Lucene?
> If yes, which of the 3 available analyzers [1] have proven to yield
> the best results?
>
> As you may know, one of our users, Toshio HIRAI, has contributed a
> tokenizer for Japanes texts in the past [2]. If we decide to include
> support for Chinese tokenization, it might as well be interesting to
> compare the results of the Apache tokenizer with our internal
> tokenizer.
>
> Cordiales salutations,
> Christian
>
> [1] 
> https://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/cjk/package-summary.html
> [2] https://docs.basex.org/wiki/Full-Text:_Japanese
>
>
>
> On Mon, Oct 12, 2020 at 11:37 AM Philippe Pons
>  wrote:
>
> Dear BaseX Team,
>
> I'm actually working on chinese texts in TEI.
> I would like to know if stemming chinese text is possible in BaseX, as we can 
> do with other languages (like english or deutsch)?
> Or maybe there is a way to add this functionnality with Lucene?
>
> Best regards,
> Philippe Pons
>
> --
> Ingénieur d'étude chargé de l'édition de corpus numériques
> Centre de recherche sur les civilisations de l'Asie Orientale
> CRCAO - UMR 8155 (Collège de France, EPHE, CNRS, PSL Research University, 
> Univ Paris Diderot, Sorbonne Paris Cité)
> 49bis avenue de la Belle Gabrielle
> 75012 Paris
> https://cv.archives-ouvertes.fr/ppons
>
>


Re: [basex-talk] stemming chinese texts

2020-10-13 Thread Philippe Pons

Dear Christian,

Thank you very much for this quick and enlightening response.

Without having had (yet) the opportunity to test it, I have indeed read 
the Japanese text tokenizer.

Supporting Chinese tokenization would also be a great help.

I have never tested what Lucene offers, especially since I have to 
manage texts in traditional Chinese and simplified Chinese (without 
reading either one myself).
I would like to test Lucene's analyzers, but I don't know how to do it 
in BaseX?


Best regards,
Philippe Pons



Le 12/10/2020 à 12:01, Christian Grün a écrit :

Dear Philippe,

As the Chinese language rarely uses inflection, there is usually no
need to perform stemming on texts. However, tokenization will be
necessary indeed. Right now, BaseX provides no tokenizer/analyzer for
Chinese texts. It should be possible indeed to adopt code from Lucene,
as we’ve already done for other languages (our software licenses allow
that).

Have you already worked with tokenization of Chinese texts in Lucene?
If yes, which of the 3 available analyzers [1] have proven to yield
the best results?

As you may know, one of our users, Toshio HIRAI, has contributed a
tokenizer for Japanes texts in the past [2]. If we decide to include
support for Chinese tokenization, it might as well be interesting to
compare the results of the Apache tokenizer with our internal
tokenizer.

Cordiales salutations,
Christian

[1] 
https://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/cjk/package-summary.html
[2] https://docs.basex.org/wiki/Full-Text:_Japanese



On Mon, Oct 12, 2020 at 11:37 AM Philippe Pons
 wrote:

Dear BaseX Team,

I'm actually working on chinese texts in TEI.
I would like to know if stemming chinese text is possible in BaseX, as we can 
do with other languages (like english or deutsch)?
Or maybe there is a way to add this functionnality with Lucene?

Best regards,
Philippe Pons

--
Ingénieur d'étude chargé de l'édition de corpus numériques
Centre de recherche sur les civilisations de l'Asie Orientale
CRCAO - UMR 8155 (Collège de France, EPHE, CNRS, PSL Research University, Univ 
Paris Diderot, Sorbonne Paris Cité)
49bis avenue de la Belle Gabrielle
75012 Paris
https://cv.archives-ouvertes.fr/ppons




Re: [basex-talk] stemming chinese texts

2020-10-12 Thread Christian Grün
Dear Philippe,

As the Chinese language rarely uses inflection, there is usually no
need to perform stemming on texts. However, tokenization will be
necessary indeed. Right now, BaseX provides no tokenizer/analyzer for
Chinese texts. It should be possible indeed to adopt code from Lucene,
as we’ve already done for other languages (our software licenses allow
that).

Have you already worked with tokenization of Chinese texts in Lucene?
If yes, which of the 3 available analyzers [1] have proven to yield
the best results?

As you may know, one of our users, Toshio HIRAI, has contributed a
tokenizer for Japanes texts in the past [2]. If we decide to include
support for Chinese tokenization, it might as well be interesting to
compare the results of the Apache tokenizer with our internal
tokenizer.

Cordiales salutations,
Christian

[1] 
https://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/cjk/package-summary.html
[2] https://docs.basex.org/wiki/Full-Text:_Japanese



On Mon, Oct 12, 2020 at 11:37 AM Philippe Pons
 wrote:
>
> Dear BaseX Team,
>
> I'm actually working on chinese texts in TEI.
> I would like to know if stemming chinese text is possible in BaseX, as we can 
> do with other languages (like english or deutsch)?
> Or maybe there is a way to add this functionnality with Lucene?
>
> Best regards,
> Philippe Pons
>
> --
> Ingénieur d'étude chargé de l'édition de corpus numériques
> Centre de recherche sur les civilisations de l'Asie Orientale
> CRCAO - UMR 8155 (Collège de France, EPHE, CNRS, PSL Research University, 
> Univ Paris Diderot, Sorbonne Paris Cité)
> 49bis avenue de la Belle Gabrielle
> 75012 Paris
> https://cv.archives-ouvertes.fr/ppons


[basex-talk] stemming chinese texts

2020-10-12 Thread Philippe Pons

Dear BaseX Team,

I'm actually working on chinese texts in TEI.
I would like to know if stemming chinese text is possible in BaseX, as 
we can do with other languages (like english or deutsch)?

Or maybe there is a way to add this functionnality with Lucene?

Best regards,
Philippe Pons

--
Ingénieur d'étude chargé de l'édition de corpus numériques
Centre de recherche sur les civilisations de l'Asie Orientale
CRCAO - UMR 8155 (Collège de France, EPHE, CNRS, PSL Research University, Univ 
Paris Diderot, Sorbonne Paris Cité)
49bis avenue de la Belle Gabrielle
75012 Paris
https://cv.archives-ouvertes.fr/ppons