Re: OpenNLP models - scores, corpora and licenses

Lance Norskog Sun, 27 Jan 2013 20:07:53 -0800

The Wikipedia tagging should provide very good training sets. Hasanybody tried using them?


On 01/25/2013 02:14 AM, Jörn Kottmann wrote:

Hello,
well, the main problem with the models on SourceForge is that theywere trained on news datafrom the 90s and do not perform very well on todays news articles orout of domain data (anything else).
When I speak here and there to our users I always get the impressionthat most people are still happywith the performance of the Tokenizer, Sentence Splitter and POSTagger, many are disappointed about theName Finder models, anyway the name finder works well if trained onyour own data.
Maybe the OntoNotes Corpus is something worth looking into.
The licensing is a gray area, you can probably get away using themodels in commercial software. The corpusproducers often restrict the usage of their corpus for researchpurposes only. The question is if they can enforcethese restrictive terms also on statistical models build on the data,since the model probably don't violate thecopyright. Sorry for not having a better answer, you probably need toask a lawyer.
The evaluations in the documentation are often just samples toillustrate how to use the tools.Have a look at at the test plans in our wiki, we record theperformance of OpenNLP there for every release we make.
The models are mostly trained with default feature generation, have alook at the documentation and our codeto get more details about it. The feature are not yet documented well,but a documentation patch to fix this
would be very welcome!

HTH,
Jörn

On 01/25/2013 10:36 AM, Christian Moen wrote:
Hello,
I'm exploring the possibility of using OpenNLP in commercialsoftware. As part of this, I'd like to assess the quality of some ofthe models available on http://opennlp.sourceforge.net/models-1.5/and also learn more about the applicable license terms.
My primary interest for now are the English models for Tokenizer,Sentence Detector and POS Tagger.
The documentation onhttp://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.htmlprovides scores for various models as part of evaluation runexamples. Do these scores generally reflect those of the models onthe SourceForge download page? Are further details on model quality,source corpora, features used, etc. available?
I've seen posts to this list explain that "the models are subject tothe licensing restrictions of the copyright holders of the corpusused to train them." as a general comment. I understand that themodels on SourceForge aren't part of any Apache OpenNLP release, butI'd very much appreciate if someone in the know could provide furtherinsights into licensing terms applicable. I'd be glad to be wrongabout this, but my understanding is that the models can't be usedcommercially.
Many thanks for any insight.


Christian

Re: OpenNLP models - scores, corpora and licenses

Reply via email to