Lance, could you say more? Do you mean WP tagging as training data for the NER task?
Thanks, jds On Sun, Jan 27, 2013 at 11:07 PM, Lance Norskog <[email protected]> wrote: > The Wikipedia tagging should provide very good training sets. Has anybody > tried using them? > > > On 01/25/2013 02:14 AM, Jörn Kottmann wrote: > >> Hello, >> >> well, the main problem with the models on SourceForge is that they were >> trained on news data >> from the 90s and do not perform very well on todays news articles or out >> of domain data (anything else). >> >> When I speak here and there to our users I always get the impression that >> most people are still happy >> with the performance of the Tokenizer, Sentence Splitter and POS Tagger, >> many are disappointed about the >> Name Finder models, anyway the name finder works well if trained on your >> own data. >> >> Maybe the OntoNotes Corpus is something worth looking into. >> >> The licensing is a gray area, you can probably get away using the models >> in commercial software. The corpus >> producers often restrict the usage of their corpus for research purposes >> only. The question is if they can enforce >> these restrictive terms also on statistical models build on the data, >> since the model probably don't violate the >> copyright. Sorry for not having a better answer, you probably need to ask >> a lawyer. >> >> The evaluations in the documentation are often just samples to illustrate >> how to use the tools. >> Have a look at at the test plans in our wiki, we record the performance >> of OpenNLP there for every release we make. >> >> The models are mostly trained with default feature generation, have a >> look at the documentation and our code >> to get more details about it. The feature are not yet documented well, >> but a documentation patch to fix this >> would be very welcome! >> >> HTH, >> Jörn >> >> On 01/25/2013 10:36 AM, Christian Moen wrote: >> >>> Hello, >>> >>> I'm exploring the possibility of using OpenNLP in commercial software. >>> As part of this, I'd like to assess the quality of some of the models >>> available on >>> http://opennlp.sourceforge.**net/models-1.5/<http://opennlp.sourceforge.net/models-1.5/>and >>> also learn more about the applicable license terms. >>> >>> My primary interest for now are the English models for Tokenizer, >>> Sentence Detector and POS Tagger. >>> >>> The documentation on http://opennlp.apache.org/**documentation/1.5.2-** >>> incubating/manual/opennlp.html<http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html>provides >>> scores for various models as part of evaluation run examples. Do >>> these scores generally reflect those of the models on the SourceForge >>> download page? Are further details on model quality, source corpora, >>> features used, etc. available? >>> >>> I've seen posts to this list explain that "the models are subject to the >>> licensing restrictions of the copyright holders of the corpus used to train >>> them." as a general comment. I understand that the models on SourceForge >>> aren't part of any Apache OpenNLP release, but I'd very much appreciate if >>> someone in the know could provide further insights into licensing terms >>> applicable. I'd be glad to be wrong about this, but my understanding is >>> that the models can't be used commercially. >>> >>> Many thanks for any insight. >>> >>> >>> Christian >>> >>> >>> >> >
