Here you may find a reported experience where the author used DBPedia
and wikipedia

[1] 
http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html

On Tue, Jan 29, 2013 at 4:43 PM, Christian Moen <[email protected]> wrote:
> Hello,
>
> We've done some experiments trying to synthesise a NER corpus from Wikipedia 
> using various heuristics and link-structure analyses.  However, our models 
> didn't turn out very good when scored against a gold standard tagged by 
> humans.  I'm sure there are many improvements we could consider, but we 
> didn't find pursuing this any further all that promising.  Basically, there 
> were too many issues to consider to make the corpus of good quality.  I 
> believe academic research in the field had similar challenges.  This was a 
> quite fun little study, though.
>
>
> Christian Moen
> アティリカ株式会社
> http://www.atilika.com
>
> On Jan 28, 2013, at 5:20 PM, Svetoslav Marinov 
> <[email protected]> wrote:
>
>> Wikipedia is not a good source for training. I've tried that but not all
>> entities in a text a tagged. Sometimes just the first occurrence of an
>> entity is tagged and the rest are not, or partially. To me the tagging
>> seemed so random that it does not pass eny criteria for a good corpus. And
>> then comes the question of how to distinguish people from places from
>> events or any other entities.
>>
>> For me, in order to use Wikipedia, one will need to do a lot of extra
>> processing before some decent quality is achieved.
>>
>> Svetoslav
>>
>>
>>
>> On 2013-01-28 05:31, "Lance Norskog" <[email protected]> wrote:
>>
>>> Yes. The wikipedia XML has person/place/etc. tags in all of the article
>>> text.
>>>
>>> On 01/27/2013 08:15 PM, John Stewart wrote:
>>>> Lance, could you say more?  Do you mean WP tagging as training data for
>>>> the
>>>> NER task?
>>>>
>>>> Thanks,
>>>>
>>>> jds
>>>>
>>>>
>>>> On Sun, Jan 27, 2013 at 11:07 PM, Lance Norskog <[email protected]>
>>>> wrote:
>>>>
>>>>> The Wikipedia tagging should provide very good training sets. Has
>>>>> anybody
>>>>> tried using them?
>>>>>
>>>>>
>>>>> On 01/25/2013 02:14 AM, Jörn Kottmann wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> well, the main problem with the models on SourceForge is that they
>>>>>> were
>>>>>> trained on news data
>>>>>> from the 90s and do not perform very well on todays news articles or
>>>>>> out
>>>>>> of domain data (anything else).
>>>>>>
>>>>>> When I speak here and there to our users I always get the impression
>>>>>> that
>>>>>> most people are still happy
>>>>>> with the performance of the Tokenizer, Sentence Splitter and POS
>>>>>> Tagger,
>>>>>> many are disappointed about the
>>>>>> Name Finder models, anyway the name finder works well if trained on
>>>>>> your
>>>>>> own data.
>>>>>>
>>>>>> Maybe the OntoNotes Corpus is something worth looking into.
>>>>>>
>>>>>> The licensing is a gray area, you can probably get away using the
>>>>>> models
>>>>>> in commercial software. The corpus
>>>>>> producers often restrict the usage of their corpus for research
>>>>>> purposes
>>>>>> only. The question is if they can enforce
>>>>>> these restrictive terms also on statistical models build on the data,
>>>>>> since the model probably don't violate the
>>>>>> copyright. Sorry for not having a better answer, you probably need to
>>>>>> ask
>>>>>> a lawyer.
>>>>>>
>>>>>> The evaluations in the documentation are often just samples to
>>>>>> illustrate
>>>>>> how to use the tools.
>>>>>> Have a look at at the test plans in our wiki, we record the
>>>>>> performance
>>>>>> of OpenNLP there for every release we make.
>>>>>>
>>>>>> The models are mostly trained with default feature generation, have a
>>>>>> look at the documentation and our code
>>>>>> to get more details about it. The feature are not yet documented well,
>>>>>> but a documentation patch to fix this
>>>>>> would be very welcome!
>>>>>>
>>>>>> HTH,
>>>>>> Jörn
>>>>>>
>>>>>> On 01/25/2013 10:36 AM, Christian Moen wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I'm exploring the possibility of using OpenNLP in commercial
>>>>>>> software.
>>>>>>>  As part of this, I'd like to assess the quality of some of the
>>>>>>> models
>>>>>>> available on
>>>>>>> http://opennlp.sourceforge.**net/models-1.5/<http://opennlp.sourceforg
>>>>>>> e.net/models-1.5/>and also learn more about the applicable license
>>>>>>> terms.
>>>>>>>
>>>>>>> My primary interest for now are the English models for Tokenizer,
>>>>>>> Sentence Detector and POS Tagger.
>>>>>>>
>>>>>>> The documentation on
>>>>>>> http://opennlp.apache.org/**documentation/1.5.2-**
>>>>>>>
>>>>>>> incubating/manual/opennlp.html<http://opennlp.apache.org/documentation
>>>>>>> /1.5.2-incubating/manual/opennlp.html>provides scores for various
>>>>>>> models as part of evaluation run examples.  Do
>>>>>>> these scores generally reflect those of the models on the SourceForge
>>>>>>> download page?  Are further details on model quality, source corpora,
>>>>>>> features used, etc. available?
>>>>>>>
>>>>>>> I've seen posts to this list explain that "the models are subject to
>>>>>>> the
>>>>>>> licensing restrictions of the copyright holders of the corpus used
>>>>>>> to train
>>>>>>> them." as a general comment.  I understand that the models on
>>>>>>> SourceForge
>>>>>>> aren't part of any Apache OpenNLP release, but I'd very much
>>>>>>> appreciate if
>>>>>>> someone in the know could provide further insights into licensing
>>>>>>> terms
>>>>>>> applicable.  I'd be glad to be wrong about this, but my
>>>>>>> understanding is
>>>>>>> that the models can't be used commercially.
>>>>>>>
>>>>>>> Many thanks for any insight.
>>>>>>>
>>>>>>>
>>>>>>> Christian
>>>>>>>
>>>>>>>
>>>>>>>
>>>
>>>
>>
>>
>



-- 
Dr. Nicolas Hernandez
Associate Professor (Maître de Conférences)
Université de Nantes - LINA CNRS UMR 6241
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
+33 (0)2 51 12 53 94
+33 (0)2 40 30 60 67

Reply via email to