Hi all, In case some of you are interested, I've implemented a UIMA component to do word tokenization. This component handles tokenization of French texts in a better way than what the WhitespaceTokenizer does.
The detail of the implementation is described on my blog [1] (in French only, sorry), and I opened a github repository [2] for those who would like to contribute or just use it. [1] http://www.fabienpoulard.info/dotclear.php?post/2010/09/06/Un-rapide-tokeniseur-en-mots-pour-le-fran%C3%A7ais [2] http://github.com/grdscarabe/uima-word-tokenizer -- Fabien Poulard LINA (UMR CNRS 6241) / Université de Nantes