The Wikipedia tagging should provide very good training sets. Has
anybody tried using them?
On 01/25/2013 02:14 AM, Jörn Kottmann wrote:
Hello,
well, the main problem with the models on SourceForge is that they
were trained on news data
from the 90s and do not perform very well on todays news articles or
out of domain data (anything else).
When I speak here and there to our users I always get the impression
that most people are still happy
with the performance of the Tokenizer, Sentence Splitter and POS
Tagger, many are disappointed about the
Name Finder models, anyway the name finder works well if trained on
your own data.
Maybe the OntoNotes Corpus is something worth looking into.
The licensing is a gray area, you can probably get away using the
models in commercial software. The corpus
producers often restrict the usage of their corpus for research
purposes only. The question is if they can enforce
these restrictive terms also on statistical models build on the data,
since the model probably don't violate the
copyright. Sorry for not having a better answer, you probably need to
ask a lawyer.
The evaluations in the documentation are often just samples to
illustrate how to use the tools.
Have a look at at the test plans in our wiki, we record the
performance of OpenNLP there for every release we make.
The models are mostly trained with default feature generation, have a
look at the documentation and our code
to get more details about it. The feature are not yet documented well,
but a documentation patch to fix this
would be very welcome!
HTH,
Jörn
On 01/25/2013 10:36 AM, Christian Moen wrote:
Hello,
I'm exploring the possibility of using OpenNLP in commercial
software. As part of this, I'd like to assess the quality of some of
the models available on http://opennlp.sourceforge.net/models-1.5/
and also learn more about the applicable license terms.
My primary interest for now are the English models for Tokenizer,
Sentence Detector and POS Tagger.
The documentation on
http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html
provides scores for various models as part of evaluation run
examples. Do these scores generally reflect those of the models on
the SourceForge download page? Are further details on model quality,
source corpora, features used, etc. available?
I've seen posts to this list explain that "the models are subject to
the licensing restrictions of the copyright holders of the corpus
used to train them." as a general comment. I understand that the
models on SourceForge aren't part of any Apache OpenNLP release, but
I'd very much appreciate if someone in the know could provide further
insights into licensing terms applicable. I'd be glad to be wrong
about this, but my understanding is that the models can't be used
commercially.
Many thanks for any insight.
Christian