Re: OpenNLP models - scores, corpora and licenses

Christian Moen Tue, 29 Jan 2013 08:55:36 -0800

Hello,

We've done some experiments trying to synthesise a NER corpus from Wikipedia 
using various heuristics and link-structure analyses.  However, our models 
didn't turn out very good when scored against a gold standard tagged by humans. 
 I'm sure there are many improvements we could consider, but we didn't find 
pursuing this any further all that promising.  Basically, there were too many 
issues to consider to make the corpus of good quality.  I believe academic 
research in the field had similar challenges.  This was a quite fun little 
study, though.



Christian Moen
アティリカ株式会社
http://www.atilika.com

On Jan 28, 2013, at 5:20 PM, Svetoslav Marinov <[email protected]> 
wrote:

> Wikipedia is not a good source for training. I've tried that but not all
> entities in a text a tagged. Sometimes just the first occurrence of an
> entity is tagged and the rest are not, or partially. To me the tagging
> seemed so random that it does not pass eny criteria for a good corpus. And
> then comes the question of how to distinguish people from places from
> events or any other entities.
> 
> For me, in order to use Wikipedia, one will need to do a lot of extra
> processing before some decent quality is achieved.
> 
> Svetoslav
> 
> 
> 
> On 2013-01-28 05:31, "Lance Norskog" <[email protected]> wrote:
> 
>> Yes. The wikipedia XML has person/place/etc. tags in all of the article
>> text.
>> 
>> On 01/27/2013 08:15 PM, John Stewart wrote:
>>> Lance, could you say more?  Do you mean WP tagging as training data for
>>> the
>>> NER task?
>>> 
>>> Thanks,
>>> 
>>> jds
>>> 
>>> 
>>> On Sun, Jan 27, 2013 at 11:07 PM, Lance Norskog <[email protected]>
>>> wrote:
>>> 
>>>> The Wikipedia tagging should provide very good training sets. Has
>>>> anybody
>>>> tried using them?
>>>> 
>>>> 
>>>> On 01/25/2013 02:14 AM, Jörn Kottmann wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> well, the main problem with the models on SourceForge is that they
>>>>> were
>>>>> trained on news data
>>>>> from the 90s and do not perform very well on todays news articles or
>>>>> out
>>>>> of domain data (anything else).
>>>>> 
>>>>> When I speak here and there to our users I always get the impression
>>>>> that
>>>>> most people are still happy
>>>>> with the performance of the Tokenizer, Sentence Splitter and POS
>>>>> Tagger,
>>>>> many are disappointed about the
>>>>> Name Finder models, anyway the name finder works well if trained on
>>>>> your
>>>>> own data.
>>>>> 
>>>>> Maybe the OntoNotes Corpus is something worth looking into.
>>>>> 
>>>>> The licensing is a gray area, you can probably get away using the
>>>>> models
>>>>> in commercial software. The corpus
>>>>> producers often restrict the usage of their corpus for research
>>>>> purposes
>>>>> only. The question is if they can enforce
>>>>> these restrictive terms also on statistical models build on the data,
>>>>> since the model probably don't violate the
>>>>> copyright. Sorry for not having a better answer, you probably need to
>>>>> ask
>>>>> a lawyer.
>>>>> 
>>>>> The evaluations in the documentation are often just samples to
>>>>> illustrate
>>>>> how to use the tools.
>>>>> Have a look at at the test plans in our wiki, we record the
>>>>> performance
>>>>> of OpenNLP there for every release we make.
>>>>> 
>>>>> The models are mostly trained with default feature generation, have a
>>>>> look at the documentation and our code
>>>>> to get more details about it. The feature are not yet documented well,
>>>>> but a documentation patch to fix this
>>>>> would be very welcome!
>>>>> 
>>>>> HTH,
>>>>> Jörn
>>>>> 
>>>>> On 01/25/2013 10:36 AM, Christian Moen wrote:
>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> I'm exploring the possibility of using OpenNLP in commercial
>>>>>> software.
>>>>>>  As part of this, I'd like to assess the quality of some of the
>>>>>> models
>>>>>> available on 
>>>>>> http://opennlp.sourceforge.**net/models-1.5/<http://opennlp.sourceforg
>>>>>> e.net/models-1.5/>and also learn more about the applicable license
>>>>>> terms.
>>>>>> 
>>>>>> My primary interest for now are the English models for Tokenizer,
>>>>>> Sentence Detector and POS Tagger.
>>>>>> 
>>>>>> The documentation on
>>>>>> http://opennlp.apache.org/**documentation/1.5.2-**
>>>>>> 
>>>>>> incubating/manual/opennlp.html<http://opennlp.apache.org/documentation
>>>>>> /1.5.2-incubating/manual/opennlp.html>provides scores for various
>>>>>> models as part of evaluation run examples.  Do
>>>>>> these scores generally reflect those of the models on the SourceForge
>>>>>> download page?  Are further details on model quality, source corpora,
>>>>>> features used, etc. available?
>>>>>> 
>>>>>> I've seen posts to this list explain that "the models are subject to
>>>>>> the
>>>>>> licensing restrictions of the copyright holders of the corpus used
>>>>>> to train
>>>>>> them." as a general comment.  I understand that the models on
>>>>>> SourceForge
>>>>>> aren't part of any Apache OpenNLP release, but I'd very much
>>>>>> appreciate if
>>>>>> someone in the know could provide further insights into licensing
>>>>>> terms
>>>>>> applicable.  I'd be glad to be wrong about this, but my
>>>>>> understanding is
>>>>>> that the models can't be used commercially.
>>>>>> 
>>>>>> Many thanks for any insight.
>>>>>> 
>>>>>> 
>>>>>> Christian
>>>>>> 
>>>>>> 
>>>>>> 
>> 
>> 
> 
>

Re: OpenNLP models - scores, corpora and licenses

Reply via email to