The Wikipedia tagging should provide very good training sets. Has anybody tried using them?

On 01/25/2013 02:14 AM, Jörn Kottmann wrote:
Hello,

well, the main problem with the models on SourceForge is that they were trained on news data from the 90s and do not perform very well on todays news articles or out of domain data (anything else).

When I speak here and there to our users I always get the impression that most people are still happy with the performance of the Tokenizer, Sentence Splitter and POS Tagger, many are disappointed about the Name Finder models, anyway the name finder works well if trained on your own data.

Maybe the OntoNotes Corpus is something worth looking into.

The licensing is a gray area, you can probably get away using the models in commercial software. The corpus producers often restrict the usage of their corpus for research purposes only. The question is if they can enforce these restrictive terms also on statistical models build on the data, since the model probably don't violate the copyright. Sorry for not having a better answer, you probably need to ask a lawyer.

The evaluations in the documentation are often just samples to illustrate how to use the tools. Have a look at at the test plans in our wiki, we record the performance of OpenNLP there for every release we make.

The models are mostly trained with default feature generation, have a look at the documentation and our code to get more details about it. The feature are not yet documented well, but a documentation patch to fix this
would be very welcome!

HTH,
Jörn

On 01/25/2013 10:36 AM, Christian Moen wrote:
Hello,

I'm exploring the possibility of using OpenNLP in commercial software. As part of this, I'd like to assess the quality of some of the models available on http://opennlp.sourceforge.net/models-1.5/ and also learn more about the applicable license terms.

My primary interest for now are the English models for Tokenizer, Sentence Detector and POS Tagger.

The documentation on http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html provides scores for various models as part of evaluation run examples. Do these scores generally reflect those of the models on the SourceForge download page? Are further details on model quality, source corpora, features used, etc. available?

I've seen posts to this list explain that "the models are subject to the licensing restrictions of the copyright holders of the corpus used to train them." as a general comment. I understand that the models on SourceForge aren't part of any Apache OpenNLP release, but I'd very much appreciate if someone in the know could provide further insights into licensing terms applicable. I'd be glad to be wrong about this, but my understanding is that the models can't be used commercially.

Many thanks for any insight.


Christian




Reply via email to