Yes. The wikipedia XML has person/place/etc. tags in all of the article text.

On 01/27/2013 08:15 PM, John Stewart wrote:
Lance, could you say more?  Do you mean WP tagging as training data for the
NER task?

Thanks,

jds


On Sun, Jan 27, 2013 at 11:07 PM, Lance Norskog <[email protected]> wrote:

The Wikipedia tagging should provide very good training sets. Has anybody
tried using them?


On 01/25/2013 02:14 AM, Jörn Kottmann wrote:

Hello,

well, the main problem with the models on SourceForge is that they were
trained on news data
from the 90s and do not perform very well on todays news articles or out
of domain data (anything else).

When I speak here and there to our users I always get the impression that
most people are still happy
with the performance of the Tokenizer, Sentence Splitter and POS Tagger,
many are disappointed about the
Name Finder models, anyway the name finder works well if trained on your
own data.

Maybe the OntoNotes Corpus is something worth looking into.

The licensing is a gray area, you can probably get away using the models
in commercial software. The corpus
producers often restrict the usage of their corpus for research purposes
only. The question is if they can enforce
these restrictive terms also on statistical models build on the data,
since the model probably don't violate the
copyright. Sorry for not having a better answer, you probably need to ask
a lawyer.

The evaluations in the documentation are often just samples to illustrate
how to use the tools.
Have a look at at the test plans in our wiki, we record the performance
of OpenNLP there for every release we make.

The models are mostly trained with default feature generation, have a
look at the documentation and our code
to get more details about it. The feature are not yet documented well,
but a documentation patch to fix this
would be very welcome!

HTH,
Jörn

On 01/25/2013 10:36 AM, Christian Moen wrote:

Hello,

I'm exploring the possibility of using OpenNLP in commercial software.
  As part of this, I'd like to assess the quality of some of the models
available on 
http://opennlp.sourceforge.**net/models-1.5/<http://opennlp.sourceforge.net/models-1.5/>and
 also learn more about the applicable license terms.

My primary interest for now are the English models for Tokenizer,
Sentence Detector and POS Tagger.

The documentation on http://opennlp.apache.org/**documentation/1.5.2-**
incubating/manual/opennlp.html<http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html>provides
 scores for various models as part of evaluation run examples.  Do
these scores generally reflect those of the models on the SourceForge
download page?  Are further details on model quality, source corpora,
features used, etc. available?

I've seen posts to this list explain that "the models are subject to the
licensing restrictions of the copyright holders of the corpus used to train
them." as a general comment.  I understand that the models on SourceForge
aren't part of any Apache OpenNLP release, but I'd very much appreciate if
someone in the know could provide further insights into licensing terms
applicable.  I'd be glad to be wrong about this, but my understanding is
that the models can't be used commercially.

Many thanks for any insight.


Christian




Reply via email to