Hi all,

In case some of you are interested, I've implemented a UIMA component
to do word tokenization. This component handles tokenization of French
texts in a better way than what the WhitespaceTokenizer does.

The detail of the implementation is described on my blog [1] (in
French only, sorry), and I opened a github repository [2] for those
who would like to contribute or just use it.

[1] 
http://www.fabienpoulard.info/dotclear.php?post/2010/09/06/Un-rapide-tokeniseur-en-mots-pour-le-fran%C3%A7ais
[2] http://github.com/grdscarabe/uima-word-tokenizer

--
Fabien Poulard
LINA (UMR CNRS 6241) / Université de Nantes

Reply via email to