On 1/24/11 6:56 AM, Michael Migdol wrote:
Hi Everyone,
I hope you don't mind a few basic questions from someone who is
quite new to NLP. Hopefully they are coherent and relevant to the group.
I have been enjoying Olivier's fantastic work on pignlproc. I am initially
trying to create an OpenNLP model which can be used to extract
entities mentioned in news articles, starting with "countries" I have
successfully created a
training set of about 10K annotated sentences extracted from a 100MB
chunk of Wikipedia using Pignlproc, and have created a country model from that
training set using the default OpenNLP CLI tools.
My first attempt to execute the extractor on the new model was the article at
http://hosted.ap.org/dynamic/stories/S/SOC_FOUR_NATIONS?SITE=MOSPL&SECTION=HOME&TEMPLATE=DEFAULT
. I ran the NameExtractor on the plain text of the article (after
sentenceDetection and tokenization). It successfully detected Sweden,
Canada, and Germany, but not the United States or China.
So, a few questions:
1) Olivier stated results for english location entities were a recall of
0.64. Does this mean that, in general, detecting only 3 of 5
countries mentioned in an article is about what one would expect?
There were actually 12 mentions in the article for the 5 distinct
countries (it found Sweden twice), so the recall for this simple test
was actually more like 42%. And obviously, a single article is not a
sufficient sample size to judge with. I know, my next task should be to
run the OpenNLP evaluator on a separate dataset, right?
Yes just label a few sentences, maybe a few hundred, and then test on this
test sample.
2) When creating the country Model, all of the sentences passed to the
trainer had countries in them. Am I supposed to be passing sentences
that do not contain countries as well?
I don't believe that this will have a big effect on your results, maybe
you want
a few non-country sentences when they contain a country name, which is
not a country in this case. Maybe a company name which includes a country
name, e.g. IBM Germany.
3) The sentences passed to the trainer were not split up into document
groups. How much of an effect will this have on the results? Is
there away to do this split using the existing pignlproc scripts?
In this case the previous map feature might not be able to work
as good, as if the sentences are split into documents. Maybe you could
create one document per Wikipedia article.
4) Does anyone have some basic troubleshooting advice for trying to
understand why specific texts are not being successfully extracted?
A powerful way to understand better how things are detected is
to carefully inspect the training data. Does it contain countries which
are not labeled? How is the context of your test sample represented
in your training data? Just do a string search in it for specific mistakes.
Lets say it did not detect this one:
... after the Sweden game ...
then you might want to search for:
- Sweden <END> game
- <END> game
- after the <START
But maybe it detected these two correctly:
... China defeated Sweden ...
Then the other cases might as well would have been detected
correctly if the previous map feature would be working correctly.
5) Are there any customizations you would suggest making to the
Feature Generator?
We believe having proper dictionary support, would be a great help:
https://issues.apache.org/jira/browse/OPENNLP-78
That would be especially helpful for detecting company names or person
names,
maybe it is not that helpful for country names, because there are not
that many countries.
I personally believe that it would be nice to have a few manually labeled
Wikipedia articles which can be used to train a small model, and then just
use the markup in the Wikipedia article to support the Name Finder with
the small model.
I guess that might create training data which has a higher precision and
recall.
Hope that helps,
Jörn