Hi Everyone,

I hope you don't mind a few basic questions from someone who is
quite new to NLP.  Hopefully they are coherent and relevant to the group.

I have been enjoying Olivier's fantastic work on pignlproc.  I am initially
trying to create an OpenNLP model which can be used to extract
entities mentioned in news articles, starting with "countries"  I have
successfully created a
training set of about 10K annotated sentences extracted from a 100MB
chunk of Wikipedia using Pignlproc, and have created a country model from that
training set using the default OpenNLP CLI tools.

My first attempt to execute the extractor on the new model was the article at
http://hosted.ap.org/dynamic/stories/S/SOC_FOUR_NATIONS?SITE=MOSPL&SECTION=HOME&TEMPLATE=DEFAULT
.  I ran the NameExtractor on the plain text of the article (after
sentenceDetection and tokenization).  It successfully detected Sweden,
Canada, and Germany, but not the United States or China.

So, a few questions:

1) Olivier stated results for english location entities were a recall of
0.64.  Does this mean that, in general, detecting only 3 of 5
countries mentioned in an article is about what one would expect?
There were actually 12 mentions in the article for the 5 distinct
countries  (it found Sweden twice), so the recall for this simple test
was actually more like 42%.  And obviously, a single article is not a
sufficient sample size to judge with.   I know, my next task should be to
run the OpenNLP evaluator on a separate dataset, right?

2) When creating the country Model, all of the sentences passed to the
trainer had countries in them.  Am I supposed to be passing sentences
that do not contain countries as well?

3) The sentences passed to the trainer were not split up into document
groups.  How much of an effect will this have on the results?  Is
there away to do this split using the existing pignlproc scripts?

4) Does anyone have some basic troubleshooting advice for trying to
understand why specific texts are not being successfully extracted?
I've got OpenNLP and Maxent up and running in my IDE, so I'm trying to
understand where the best places to break and look at intermediate
results.  Is this usually your approach, or do you take more of a
black-box approach?  In a similar vein, do you know an easy way to
view the feature vector for a specific entry in the model?

5) Are there any customizations you would suggest making to the
Feature Generator?

Thanks to Olivier for sharing his work with the semantic community!  I
appreciate any and all help you can provide.

Thanks in advance,
Michael

Reply via email to