Hello, I suggest not to use the old models anymore, - especially the name finders - don't perform well on recent news articles.
I am not aware which data was used to train the sentence detector, tokenizer and pos tagger. The latter I guess could be based on brown and penn treebank data. There is support to train with OntoNotes. I think most components can now be trained on the data, but I only did that for the name finders which turned out to work quite well. OntoNotes can be acquired very cheaply. Jörn On Fri, 2015-01-30 at 15:12 +0000, [email protected] wrote: > I am using Opennlp in my research to extract > terms from educational corpus and I would like to ask you about the > opennlp models (chunker, Sentence Detector, Tokenizer, maxentropy POS > tagger). What is the training data set used. It is mentioned clearly that > CONLL 2000 is used to train the chunker. however, no information is > provided about the training data used in Sentence Detector, Tokenizer, > maxentropy POS tagger.
