A while back I started thinking about if wikinews could be
used as a training source as part of a community annotation
project over at OpenNLP. I guess your experience and your code
would be really helpful to transform that data into a format
we could use for such a project. Over time we would pull in the
new articles to keep up with new topics.

In that annotation project we could introduce the concept of
"atomic" annotations. That are annotations which are only considered as
correct in a part of the article. Some named entity annotations could maybe directly
created from the wiki markup with an approach similar to the one
you used. And more could be produced by the community.
I guess it is possible to give these partial available named entities to our name finder to automatically label the rest of the article with a higher precision than usual.

After we manually labeled a few hundred articles with entities we could even
go a step further and try to create new features for the name finder
which take the wiki markup into account (such a name finder could also help your
project to process the whole wikipedia).

If we start something like that it might be only useful for the tokenizer, sentence detector and name finder in a short term. Maybe over time it is even possible to
add annotations for all the components we have in OpenNLP into this corpus.

What do others think ?

Jörn

On 1/13/11 6:06 PM, Olivier Grisel wrote:
2011/1/13 Jörn Kottmann<[email protected]>:
On 1/11/11 2:21 PM, Olivier Grisel wrote:
2011/1/4 Olivier Grisel<[email protected]>:
I plan to give more details in a blog post soon (tm).
Here it is:

http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html

It gives a bit more context and some additional results and clues for
improvements and potential new usages.

Now I read this post too, sounds very interesting.

What is the biggest training file for the name finder you can generate with
this method?
It depends on the class of the entity you are interested in and the
language of the dump. For instance for the pair (person / French) I
have more than 600k sentences. For English it is gonna be much bigger.
For entity class such as "Drug" or "Protein" this is much lower (I
would say a couple of thousands of sentences).

I trained my French models on my laptop with limited memory (2GB
allocated to the heapspace) hence I stopped at ~100k sentences in the
training file to avoid GC trashing. On Amazon EC2 instances with more
10GB RAM I guess you could train a model on 500k sentences and test it
on the remaining 100k sentences for instance. For such scales average
perceptron learners or SGD-based logistic regression model as
implemented in Apache Mahout would probably be faster to train than
the current MaxEnt impl.

I think we need MapReduce training support for OpenNLP. Actually that is
already on my todo list, but currently I am still busy with the Apache 
migration and the
next release.
Alright no hurry. Please ping me as soon as you are ready to discuss this.

Anyway I hope we can get that done at least partially for the name finder
this year.
Great :)


Reply via email to