Re: OpenNLP models - scores, corpora and licenses

John Stewart Sun, 27 Jan 2013 20:15:52 -0800

Lance, could you say more?  Do you mean WP tagging as training data for the
NER task?


Thanks,

jds


On Sun, Jan 27, 2013 at 11:07 PM, Lance Norskog <[email protected]> wrote:

> The Wikipedia tagging should provide very good training sets. Has anybody
> tried using them?
>
>
> On 01/25/2013 02:14 AM, Jörn Kottmann wrote:
>
>> Hello,
>>
>> well, the main problem with the models on SourceForge is that they were
>> trained on news data
>> from the 90s and do not perform very well on todays news articles or out
>> of domain data (anything else).
>>
>> When I speak here and there to our users I always get the impression that
>> most people are still happy
>> with the performance of the Tokenizer, Sentence Splitter and POS Tagger,
>> many are disappointed about the
>> Name Finder models, anyway the name finder works well if trained on your
>> own data.
>>
>> Maybe the OntoNotes Corpus is something worth looking into.
>>
>> The licensing is a gray area, you can probably get away using the models
>> in commercial software. The corpus
>> producers often restrict the usage of their corpus for research purposes
>> only. The question is if they can enforce
>> these restrictive terms also on statistical models build on the data,
>> since the model probably don't violate the
>> copyright. Sorry for not having a better answer, you probably need to ask
>> a lawyer.
>>
>> The evaluations in the documentation are often just samples to illustrate
>> how to use the tools.
>> Have a look at at the test plans in our wiki, we record the performance
>> of OpenNLP there for every release we make.
>>
>> The models are mostly trained with default feature generation, have a
>> look at the documentation and our code
>> to get more details about it. The feature are not yet documented well,
>> but a documentation patch to fix this
>> would be very welcome!
>>
>> HTH,
>> Jörn
>>
>> On 01/25/2013 10:36 AM, Christian Moen wrote:
>>
>>> Hello,
>>>
>>> I'm exploring the possibility of using OpenNLP in commercial software.
>>>  As part of this, I'd like to assess the quality of some of the models
>>> available on 
>>> http://opennlp.sourceforge.**net/models-1.5/<http://opennlp.sourceforge.net/models-1.5/>and
>>>  also learn more about the applicable license terms.
>>>
>>> My primary interest for now are the English models for Tokenizer,
>>> Sentence Detector and POS Tagger.
>>>
>>> The documentation on http://opennlp.apache.org/**documentation/1.5.2-**
>>> incubating/manual/opennlp.html<http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html>provides
>>>  scores for various models as part of evaluation run examples.  Do
>>> these scores generally reflect those of the models on the SourceForge
>>> download page?  Are further details on model quality, source corpora,
>>> features used, etc. available?
>>>
>>> I've seen posts to this list explain that "the models are subject to the
>>> licensing restrictions of the copyright holders of the corpus used to train
>>> them." as a general comment.  I understand that the models on SourceForge
>>> aren't part of any Apache OpenNLP release, but I'd very much appreciate if
>>> someone in the know could provide further insights into licensing terms
>>> applicable.  I'd be glad to be wrong about this, but my understanding is
>>> that the models can't be used commercially.
>>>
>>> Many thanks for any insight.
>>>
>>>
>>> Christian
>>>
>>>
>>>
>>
>

Re: OpenNLP models - scores, corpora and licenses

Reply via email to