Word tokenizer for French

Fabien POULARD Wed, 08 Sep 2010 06:01:09 -0700

Hi all,

In case some of you are interested, I've implemented a UIMA component
to do word tokenization. This component handles tokenization of French
texts in a better way than what the WhitespaceTokenizer does.


The detail of the implementation is described on my blog [1] (in
French only, sorry), and I opened a github repository [2] for those
who would like to contribute or just use it.

[1] 
http://www.fabienpoulard.info/dotclear.php?post/2010/09/06/Un-rapide-tokeniseur-en-mots-pour-le-fran%C3%A7ais
[2] http://github.com/grdscarabe/uima-word-tokenizer

--
Fabien Poulard
LINA (UMR CNRS 6241) / Université de Nantes

Word tokenizer for French

Reply via email to