On 04/10/2012 04:44 PM, Joan Codina wrote:
But to train the system I only found that file... which is small.
http://opennlp.cvs.sourceforge.net/viewvc/opennlp/opennlp/src/test/resources/opennlp/tools/tokenize/token.train?view=markup
which only contains 121 sentences. i don't know if this is enough or
there are other training annotated models
No, that is not enough. Get some training data set for the language you
need. Most of the data sets
referenced in the Corpora section can be used to train the tokenizer.
These corpora are already tokenized
and can be de-tokenized into training data for the tokenizer.
Jörn